Crawler and IR Bibliography
This list should be considered a must-read for people who work on the LARM Crawling part. Add entries if you think they fall under this criterion. If you do so, add a link to CiteSeer if applicable.
Add summaries of the papers at will!
Crawling Strategies and Techniques
Cho et al. (2002) Efficient Crawling Through URL Reordering Compares different methods in which order a part of the web should be crawled, among them depth-first and breadth-first crawls, and crawls with pages ordered by PageRank. Among those, the PageRanked order does best, and breadth-first is second
Cho (2002) Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data. (PDF) Jungho Cho's Ph.D. thesis. Contains the contents of Cho's other crawler papers (e.g. URL Reordering)
Broder et al (2003) Efficient URL Caching for World Wide Web Crawling The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective - in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.
== Focused Crawling ==
(I haven't reviewed these yet)
Descriptions of Existing Crawlers
Chakrabarti, Mining the Web: Analysis of Hypertext and Semi Structured Data with a chapter on crawlers
Structure and Dynamics of the Web
Pirolli, Pitkow, Rao (1996) Silk from a Sow's Ear: Extracting usable Structures from the Web experiments on how web pages may be classified i.e. in home pages, personal pages, etc.
Searching the Workplace Web Implications for search engines in Intranets
- Broder et al. () Graph structure in the web - Summarizes in- and outdegree distributions of web pages, the sizes of strongly and weakly connected components (all forming scale-free power law distributions), and introduces the notion of the web graph forming a "Bowtie" with several parts ("giant SCC", IN nodes, OUT nodes, and so called Tendrils) of equal size. It concludes the diameter of the giant SCC is at least 28, but 75% of the time there is no direct link between any two random pages in the web.
Dill, Kumar et al. (2001) Self-Similarity in the Web - Shows that the web shows self-similarity on different levels, and that these properties are robust and fault-tolerant.
Guillaume, Latapy (2002) The Web Graph: an Overview (not reviewed)
Accessing the Web Graph
Cooper, Frieze (2001) A general model of web graphs (not reviewed)
- Cho, Garcia-Molina (2001) Estimating Frequency of Change
Mining the Hypertext
Kleinberg Jon (1997): Authoritative Sources in a Hyperlinked Environment Defines two measures for a web page: A "hub" score and an "authority" score. Hubs are pages pointing to good authorities, and good authorities are linked to by many good hubs. This was the first major idea for using link structures for page ranking, but some disadvantages (e.g. the fact that much computation has to be made at query time, and that the algorithm is prone to a "topic drift" phenomenon) made that Page Rank (below) became more popular.
Brin, Page (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine Describes properties of the (original) Google search engine (also contains some notes about its crawler). Introduces a measure of page quality, "Page Rank", which is formed by iteratively counting the weighted in-links of each page. This measure reveals some interesting properties, e.g. it equals the probability of a page being visited in a random walk. It is used today in some form by most major search engines as one factor for ranking search results, and has also been used for optimizing a web crawler (see above, URL-Reordering)
More about PageRank
Brandman et al. (2000) Crawler-Friendly Web Servers (not reviewed yet)
McLearn (2002) Autonomous Cooperating Web Crawlers (not reviewed yet)
Gupta, Campbell (2000) Internet Search Engine Freshness by Web Server Help (not reviewed yet)
Exploiting Web Structure
Glover et al (2002a) Using Web Structure for Classifying and Describing Web Pages The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages