Crawler and IR Bibliography

This list should be considered a must-read for people who work on the LARM Crawling part. Add entries if you think they fall under this criterion. If you do so, add a link to CiteSeer if applicable.

Add summaries of the papers at will!

Crawling Strategies and Techniques

Focused Crawling

(I haven't reviewed these yet)

Descriptions of Existing Crawlers

Crawler Implementations

Offline (Books)

Structure and Dynamics of the Web

Page Structure

Web Structure

  • Searching the Workplace Web Implications for search engines in Intranets
  • Kumar et al. (2000) The Web as a graph
  • Broder et al. () Graph structure in the web - Summarizes in- and outdegree distributions of web pages, the sizes of strongly and weakly connected components (all forming scale-free power law distributions), and introduces the notion of the web graph forming a "Bowtie" with several parts ("giant SCC", IN nodes, OUT nodes, and so called Tendrils) of equal size. It concludes the diameter of the giant SCC is at least 28, but 75% of the time there is no direct link between any two random pages in the web.
  • Dill, Kumar et al. (2001) Self-Similarity in the Web - Shows that the web shows self-similarity on different levels, and that these properties are robust and fault-tolerant.
  • Guillaume, Latapy (2002) The Web Graph: an Overview (not reviewed)

Accessing the Web Graph

Dynamics

Mining the Hypertext

Classics

  • Kleinberg Jon (1997): Authoritative Sources in a Hyperlinked Environment Defines two measures for a web page: A "hub" score and an "authority" score. Hubs are pages pointing to good authorities, and good authorities are linked to by many good hubs. This was the first major idea for using link structures for page ranking, but some disadvantages (e.g. the fact that much computation has to be made at query time, and that the algorithm is prone to a "topic drift" phenomenon) made that Page Rank (below) became more popular.
  • Brin, Page (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine Describes properties of the (original) Google search engine (also contains some notes about its crawler). Introduces a measure of page quality, "Page Rank", which is formed by iteratively counting the weighted in-links of each page. This measure reveals some interesting properties, e.g. it equals the probability of a page being visited in a random walk. It is used today in some form by most major search engines as one factor for ranking search results, and has also been used for optimizing a web crawler (see above, URL-Reordering)

More about PageRank

Other

Index Maintenance

Miscellaneous

Exploiting Web Structure

  • Glover et al (2002a) Using Web Structure for Classifying and Describing Web Pages The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages

Other Bibliographies

Further Reading...

  • No labels