Cisco has filed a patent application on a method that seeds search engine crawlers using intercepted network traffic. Cisco's method includes monitoring data packets exchanged in a computer network over which documents having respective location identifiers are distributed, so as to detect a request to access a given document.
A location identifier of the given document is extracted from the request. The location identifier is provided to a search engine that searches for data in a set of the documents, so as to cause the search engine to add the given document to the set.
I'm wondering whether Cisco has cleverly found a way for its gear to become search engine toll collectors. For example, Cisco's patent application specifically states:
"Although the embodiments described herein mainly address seeding of web-crawling search engines, the principles of the present invention can also be used for additional applications, such as for controlling the re-crawl frequency for a given Web page.
"It will thus be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove.
"Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art."
A block diagram that schematically illustrates a system for searching for data in a computer network:
FIG. 1 is a block diagram that schematically illustrates a system 20 for searching in a computer network 24, in accordance with an embodiment of the present invention. Network 24 may comprise, for example, a Wide-Area Network (WAN) such as the Internet, a Metropolitan-Area Network (MAN), a Local-Area Network (LAN) or a combination of such network types. Network 24 may comprise a public network or an enterprise network (sometimes referred to as an Intranet). Additionally or alternatively, network 24 may comprise any other suitable network type. The network typically comprises a packet-switched network, such as an Internet Protocol (IP) network.
Network 24 comprises servers 26, which store data in Web pages 28. Each page is assigned a unique location identifier, such as a Uniform Resource Locator (URL). In some embodiments, the servers host web pages that are produced a-priori. In alternative embodiments, the servers generate Web pages directly based on user input.
The methods and systems described herein can be used in any suitable network over which documents are distributed, regardless of whether the documents are stored a-priori or generated on-demand.
Although the exemplary embodiment of FIG. 1 refers to servers, the methods and systems described herein can be used with any other sort of storage or computing devices known in the art.
Moreover, although the embodiments described herein refer to web pages, the disclosed methods and systems can be used with any other suitable type of document.
In the context of the present patent application and in the claims, the term "document" refers to any kind of data resource having a location identifier, such as, for example, a file, a Web page, a database record, a web service or another generic computing service.
Network 24 comprises network elements, such as routers 32, which perform routing or forwarding of data packets in the network. Although the description that follows refers to network routers, the methods and systems described herein can be used with various other kinds of network elements that process data packets, such as switches or gateways.
System 20 comprises one or more search engines 36, which search for data in network 24 in response to user queries. Search engines 36 use web-crawling techniques, as are known in the art. For example, search engine 36 may comprise a Google search engine, which is provided by Google, or the open-source Nutch search engine provides by the Apache Software Foundation. Search engines 36 may comprise different instances of a certain search engine (eg, multiple Google Appliance boxes) and/or search engines of different types.
Each search engine 36 maintains a web-graph or equivalent data structure, which represents a set of pages that are currently known to the search engine and the links between them. The search engine searches for data in the set of pages, typically by (1) producing an index that maps words to the pages in which they appear, and (2) querying the index in response to user queries.
The search engine creates the web-graph in a progressive manner. The search engine is initially provided with a set of pages, eg, a set of popular web pages, which are referred to as a seed.
The search engine "crawls" the web by following links that appear in the seed pages and adding the linked pages to the web-graph. When a page is added to the web-graph, the search engine updates the index with the words that are found in this page.
The crawling process continues in a progressive manner by following the links in the newly-added pages, so that the web-graph is expanded constantly. Since page content may change over time, the search engine typically performs re-crawling, ie, revisits pages that already exist in the web-graph, in accordance with a certain re-crawling policy.
As can be appreciated, search engine 36 can index and search only pages that belong to its web-graph. Pages that do not exist in the web-graph will not be indexed and the data in these pages cannot be retrieved.
Embodiments of the present invention provide improved methods and systems for adding pages to the web-graphs of search engines 36.
As will be described in detail further below, routers 32 (or other network elements in network 24) monitor data packets exchanged in the network, in order to detect requests from users to access Web pages 28. When a router detects a request to access a certain page, it extracts an identifier of the requested page from the request, and forwards the identifier to the search engines. The search engines may choose to add the reported pages to their web-graphs.
Thus, pages that are not linked to the seed pages, but are requested by users, can be reached, indexed and searched by the search engines.