Differences between revisions 1 and 2
Revision 1 as of 2005-03-22 05:44:16
Size: 10082
Editor: anonymous
Comment: missing edit-log entry for this revision
Revision 2 as of 2009-09-20 21:48:04
Size: 10160
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
This list should be considered a must-read for people who work on the LARM Crawling part. Add entries if you think they fall under this criterion. If you do so, add a link to [http://citeseer.nj.nec.com/cs CiteSeer] if applicable. This list should be considered a must-read for people who work on the LARM Crawling part. Add entries if you think they fall under this criterion. If you do so, add a link to [[http://citeseer.nj.nec.com/cs|CiteSeer]] if applicable.
Line 10: Line 10:
 * [http://www.www2003.org/cdrom/papers/refereed/p096/p96-broder.html Efficient URL caching for World Wide Web crawling]  * [[http://www.www2003.org/cdrom/papers/refereed/p096/p96-broder.html|Efficient URL caching for World Wide Web crawling]]
Line 12: Line 12:
 * [http://citeseer.nj.nec.com/cho02parallel.html Cho (2002) Parallel Crawlers]  * [[http://citeseer.nj.nec.com/cho02parallel.html|Cho (2002) Parallel Crawlers]]
Line 14: Line 14:
 * [http://rose.cs.ucla.edu/~cho/papers/cho-order.pdf Cho et al. (2002) Efficient Crawling Through URL Reordering] Compares different methods in which order a part of the web should be crawled, among them depth-first and breadth-first crawls, and crawls with pages ordered by PageRank. Among those, the PageRanked order does best, and breadth-first is second  * [[http://rose.cs.ucla.edu/~cho/papers/cho-order.pdf|Cho et al. (2002) Efficient Crawling Through URL Reordering]] Compares different methods in which order a part of the web should be crawled, among them depth-first and breadth-first crawls, and crawls with pages ordered by PageRank. Among those, the PageRanked order does best, and breadth-first is second
Line 16: Line 16:
 * [http://citeseer.nj.nec.com/257424.html Fiedler, Hammer (1999) Using the Web Efficiently: Mobile Crawlers]  * [[http://citeseer.nj.nec.com/257424.html|Fiedler, Hammer (1999) Using the Web Efficiently: Mobile Crawlers]]
Line 18: Line 18:
 * [http://rose.cs.ucla.edu/~cho/papers/cho-thesis.pdf Cho (2002) Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data.] (PDF) Jungho Cho's Ph.D. thesis. Contains the contents of Cho's other crawler papers (e.g. URL Reordering)  * [[http://rose.cs.ucla.edu/~cho/papers/cho-thesis.pdf|Cho (2002) Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data.]] (PDF) Jungho Cho's Ph.D. thesis. Contains the contents of Cho's other crawler papers (e.g. URL Reordering)
Line 21: Line 21:
 * [http://www.www2003.org/cdrom/papers/refereed/p096/p96-broder.html Broder et al (2003) Efficient URL Caching for World Wide Web Crawling] The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective - in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.  * [[http://www.www2003.org/cdrom/papers/refereed/p096/p96-broder.html|Broder et al (2003) Efficient URL Caching for World Wide Web Crawling]] The main goal of this paper is to carefully investigate several URL caching techniques for web crawling. We consider both practical algorithms: random replacement, static cache, LRU, and CLOCK, and theoretical limits: clairvoyant caching and infinite cache. We performed about 1,800 simulations using these algorithms with various cache sizes, using actual log data extracted from a massive 33 day web crawl that issued over one billion HTTP requests. Our main conclusion is that caching is very effective - in our setup, a cache of roughly 50,000 entries can achieve a hit rate of almost 80%. Interestingly, this cache size falls at a critical point: a substantially smaller cache is much less effective while a substantially larger cache brings little additional benefit. We conjecture that such critical points are inherent to our problem and venture an explanation for this phenomenon.
Line 27: Line 27:
 * [http://citeseer.nj.nec.com/bergmark02focused.html Bergmark et al. (2002) Focused Crawls, Tunneling, and Digital Libraries]  * [[http://citeseer.nj.nec.com/bergmark02focused.html|Bergmark et al. (2002) Focused Crawls, Tunneling, and Digital Libraries]]
Line 29: Line 29:
 * [http://citeseer.nj.nec.com/chakrabarti99focused.html Chakrabarti, van den Berg, Dom (1999) Focused crawling: a new approach to topic-specifig Web resource discovery]  * [[http://citeseer.nj.nec.com/chakrabarti99focused.html|Chakrabarti, van den Berg, Dom (1999) Focused crawling: a new approach to topic-specifig Web resource discovery]]
Line 31: Line 31:
 * [http://citeseer.nj.nec.com/gupta00internet.html Menczer et al. (2002) Topic-Driven Crawlers: Machine Learning Issues]  * [[http://citeseer.nj.nec.com/gupta00internet.html|Menczer et al. (2002) Topic-Driven Crawlers: Machine Learning Issues]]
Line 33: Line 33:
 * [http://citeseer.nj.nec.com/548327.html Chau, Chen () Personalized and Focused Web Spiders]  * [[http://citeseer.nj.nec.com/548327.html|Chau, Chen () Personalized and Focused Web Spiders]]
Line 38: Line 38:
 * [http://citeseer.nj.nec.com/shkapenyuk01design.html Shkapenyuk, Suel (2001) Design and Implementation of a High-Performance Distributed Web Crawler]  * [[http://citeseer.nj.nec.com/shkapenyuk01design.html|Shkapenyuk, Suel (2001) Design and Implementation of a High-Performance Distributed Web Crawler]]
Line 40: Line 40:
 * [http://citeseer.nj.nec.com/najork01highperformance.html Heydon, Najork (2000) High Performance Web-Crawling]  * [[http://citeseer.nj.nec.com/najork01highperformance.html|Heydon, Najork (2000) High Performance Web-Crawling]]
Line 42: Line 42:
 * [http://research.compaq.com/SRC/mercator/papers/www/paper.pdf Heydon, Najork (1999) Mercator: A Scalable, Extensible Web Crawler] (PDF)  * [[http://research.compaq.com/SRC/mercator/papers/www/paper.pdf|Heydon, Najork (1999) Mercator: A Scalable, Extensible Web Crawler]] (PDF)
Line 44: Line 44:
 * [http://citeseer.nj.nec.com/chakrabarti99focused.html Boldi et al. (2002) UbiCrawler: A Scalable Fully Distributed Web Crawler]  * [[http://citeseer.nj.nec.com/chakrabarti99focused.html|Boldi et al. (2002) UbiCrawler: A Scalable Fully Distributed Web Crawler]]
Line 49: Line 49:
 * [http://research.compaq.com/SRC/mercator/research.html The Mercator Crawler]  * [[http://research.compaq.com/SRC/mercator/research.html|The Mercator Crawler]]
Line 53: Line 53:
 * [http://www.amazon.com/exec/obidos/tg/detail/-/1558605703/qid=1055025089/sr=8-1/ref=sr_8_1/104-8508891-4793546?v=glance&s=books&n=507846 Witten, Moffat, Bell: Managing Gigabytes]  * [[http://www.amazon.com/exec/obidos/tg/detail/-/1558605703/qid=1055025089/sr=8-1/ref=sr_8_1/104-8508891-4793546?v=glance&s=books&n=507846|Witten, Moffat, Bell: Managing Gigabytes]]
Line 55: Line 55:
 * [http://www.amazon.com/exec/obidos/ASIN/020139829X/qid=1055025120/sr=2-1/ref=sr_2_1/104-8508891-4793546 Baeza-Yates, Ribeiro-Neto: Modern Information Retrieval]  * [[http://www.amazon.com/exec/obidos/ASIN/020139829X/qid=1055025120/sr=2-1/ref=sr_2_1/104-8508891-4793546|Baeza-Yates, Ribeiro-Neto: Modern Information Retrieval]]
Line 57: Line 57:
 * [http://www.amazon.com/exec/obidos/tg/detail/-/1558607544/qid=1068135618//ref=sr_8_xs_ap_i0_xgl14/104-4927501-7811136?v=glance&s=books&n=507846 Chakrabarti, Mining the Web: Analysis of Hypertext and Semi Structured Data] with a chapter on crawlers  * [[http://www.amazon.com/exec/obidos/tg/detail/-/1558607544/qid=1068135618//ref=sr_8_xs_ap_i0_xgl14/104-4927501-7811136?v=glance&s=books&n=507846|Chakrabarti, Mining the Web: Analysis of Hypertext and Semi Structured Data]] with a chapter on crawlers
Line 65: Line 65:
 * [http://citeseer.nj.nec.com/pirolli96silk.html Pirolli, Pitkow, Rao (1996) Silk from a Sow's Ear: Extracting usable Structures from the Web] experiments on how web pages may be classified i.e. in home pages, personal pages, etc.  * [[http://citeseer.nj.nec.com/pirolli96silk.html|Pirolli, Pitkow, Rao (1996) Silk from a Sow's Ear: Extracting usable Structures from the Web]] experiments on how web pages may be classified i.e. in home pages, personal pages, etc.
Line 69: Line 69:
 * [http://www.www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html Searching the Workplace Web] Implications for search engines in Intranets  * [[http://www.www2003.org/cdrom/papers/refereed/p641/xhtml/p641-mccurley.html|Searching the Workplace Web]] Implications for search engines in Intranets
Line 71: Line 71:
 * [http://citeseer.nj.nec.com/290635.html Kumar et al. (2000) The Web as a graph]  * [[http://citeseer.nj.nec.com/290635.html|Kumar et al. (2000) The Web as a graph]]
Line 75: Line 75:
 * [http://citeseer.nj.nec.com/dill01selfsimilarity.html Dill, Kumar et al. (2001) Self-Similarity in the Web] - Shows that the web shows self-similarity on different levels, and that these properties are robust and fault-tolerant.  * [[http://citeseer.nj.nec.com/dill01selfsimilarity.html|Dill, Kumar et al. (2001) Self-Similarity in the Web]] - Shows that the web shows self-similarity on different levels, and that these properties are robust and fault-tolerant.
Line 77: Line 77:
 * [http://citeseer.nj.nec.com/guillaume02web.html Guillaume, Latapy (2002) The Web Graph: an Overview] (not reviewed)  * [[http://citeseer.nj.nec.com/guillaume02web.html|Guillaume, Latapy (2002) The Web Graph: an Overview]] (not reviewed)
Line 81: Line 81:
 * [http://www.henzinger.com/monika/mpapers/www7.html Bharat, Broder, Henzinger (1998) The Connectivity Server: Fast Access to Linkage Information on the Web]  * [[http://www.henzinger.com/monika/mpapers/www7.html|Bharat, Broder, Henzinger (1998) The Connectivity Server: Fast Access to Linkage Information on the Web]]
Line 83: Line 83:
 * [http://citeseer.nj.nec.com/suel01compressing.html Suel, Yuan (2001) Compressing the Graph Structure of the Web]  * [[http://citeseer.nj.nec.com/suel01compressing.html|Suel, Yuan (2001) Compressing the Graph Structure of the Web]]
Line 85: Line 85:
 * [http://citeseer.nj.nec.com/randall01link.html Randall et al. (2001) The Link Database: Fast Access to Graphs of the Web]  * [[http://citeseer.nj.nec.com/randall01link.html|Randall et al. (2001) The Link Database: Fast Access to Graphs of the Web]]
Line 87: Line 87:
 * [http://citeseer.nj.nec.com/532182.html Cooper, Frieze (2001) A general model of web graphs] (not reviewed)  * [[http://citeseer.nj.nec.com/532182.html|Cooper, Frieze (2001) A general model of web graphs]] (not reviewed)
Line 91: Line 91:
 * [http://www.www2003.org/cdrom/papers/refereed/p097/P97%20sources/p97-fetterly.html A large-scale study of the evolution of Web pages]  * [[http://www.www2003.org/cdrom/papers/refereed/p097/P97%20sources/p97-fetterly.html|A large-scale study of the evolution of Web pages]]
Line 100: Line 100:
 * [http://citeseer.nj.nec.com/kleinberg97authoritative.html Kleinberg Jon (1997): Authoritative Sources in a Hyperlinked Environment] Defines two measures for a web page: A "hub" score and an "authority" score. Hubs are pages pointing to good authorities, and good authorities are linked to by many good hubs. This was the first major idea for using link structures for page ranking, but some disadvantages (e.g. the fact that much computation has to be made at query time, and that the algorithm is prone to a "topic drift" phenomenon) made that Page Rank (below) became more popular.  * [[http://citeseer.nj.nec.com/kleinberg97authoritative.html|Kleinberg Jon (1997): Authoritative Sources in a Hyperlinked Environment]] Defines two measures for a web page: A "hub" score and an "authority" score. Hubs are pages pointing to good authorities, and good authorities are linked to by many good hubs. This was the first major idea for using link structures for page ranking, but some disadvantages (e.g. the fact that much computation has to be made at query time, and that the algorithm is prone to a "topic drift" phenomenon) made that Page Rank (below) became more popular.
Line 102: Line 102:
 * [http://citeseer.nj.nec.com/brin98anatomy.html Brin, Page (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine] Describes properties of the (original) Google search engine (also contains some notes about its crawler). Introduces a measure of page quality, "Page Rank", which is formed by iteratively counting the weighted in-links of each page. This measure reveals some interesting properties, e.g. it equals the probability of a page being visited in a random walk. It is used today in some form by most major search engines as one factor for ranking search results, and has also been used for optimizing a web crawler (see above, URL-Reordering)  * [[http://citeseer.nj.nec.com/brin98anatomy.html|Brin, Page (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine]] Describes properties of the (original) Google search engine (also contains some notes about its crawler). Introduces a measure of page quality, "Page Rank", which is formed by iteratively counting the weighted in-links of each page. This measure reveals some interesting properties, e.g. it equals the probability of a page being visited in a random walk. It is used today in some form by most major search engines as one factor for ranking search results, and has also been used for optimizing a web crawler (see above, URL-Reordering)
Line 106: Line 106:
 * [http://www.www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/index.html Extrapolation Methods for Accelerating PageRank Computations]  * [[http://www.www2003.org/cdrom/papers/refereed/p270/kamvar-270-xhtml/index.html|Extrapolation Methods for Accelerating PageRank Computations]]
Line 110: Line 110:
 * [http://www.www2003.org/cdrom/papers/refereed/p042/paper42_html/p42-tomlin.htm A New Paradigm for Ranking Pages on the World Wide Web]  * [[http://www.www2003.org/cdrom/papers/refereed/p042/paper42_html/p42-tomlin.htm|A New Paradigm for Ranking Pages on the World Wide Web]]
Line 114: Line 114:
 * [http://www.www2003.org/cdrom/papers/refereed/p300/p300-Yu.html Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation]
 * [http://www.www2003.org/cdrom/papers/refereed/p656/p656-lim.html Dynamic Maintenance of Web Indexes Using Landmarks]
 * [[http://www.www2003.org/cdrom/papers/refereed/p300/p300-Yu.html|Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation]]
 * [[http://www.www2003.org/cdrom/papers/refereed/p656/p656-lim.html|Dynamic Maintenance of Web Indexes Using Landmarks]]
Line 119: Line 119:
 * [http://citeseer.nj.nec.com/307846.html Brandman et al. (2000) Crawler-Friendly Web Servers] (not reviewed yet)  * [[http://citeseer.nj.nec.com/307846.html|Brandman et al. (2000) Crawler-Friendly Web Servers]] (not reviewed yet)
Line 121: Line 121:
 * [http://citeseer.nj.nec.com/307846.html McLearn (2002) Autonomous Cooperating Web Crawlers] (not reviewed yet)  * [[http://citeseer.nj.nec.com/307846.html|McLearn (2002) Autonomous Cooperating Web Crawlers]] (not reviewed yet)
Line 123: Line 123:
 * [http://citeseer.nj.nec.com/gupta00internet.html Gupta, Campbell (2000) Internet Search Engine Freshness by Web Server Help] (not reviewed yet)  * [[http://citeseer.nj.nec.com/gupta00internet.html|Gupta, Campbell (2000) Internet Search Engine Freshness by Web Server Help]] (not reviewed yet)
Line 127: Line 127:
 * [http://www2002.org/CDROM/refereed/504/ Glover et al (2002a) Using Web Structure for Classifying and Describing Web Pages] The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages  * [[http://www2002.org/CDROM/refereed/504/|Glover et al (2002a) Using Web Structure for Classifying and Describing Web Pages]] The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages

Crawler and IR Bibliography

This list should be considered a must-read for people who work on the LARM Crawling part. Add entries if you think they fall under this criterion. If you do so, add a link to CiteSeer if applicable.

Add summaries of the papers at will!

Crawling Strategies and Techniques

== Focused Crawling ==

(I haven't reviewed these yet)

Descriptions of Existing Crawlers

Crawler Implementations

Offline (Books)

Structure and Dynamics of the Web

Page Structure

Web Structure

  • Searching the Workplace Web Implications for search engines in Intranets

  • Kumar et al. (2000) The Web as a graph

  • Broder et al. () Graph structure in the web - Summarizes in- and outdegree distributions of web pages, the sizes of strongly and weakly connected components (all forming scale-free power law distributions), and introduces the notion of the web graph forming a "Bowtie" with several parts ("giant SCC", IN nodes, OUT nodes, and so called Tendrils) of equal size. It concludes the diameter of the giant SCC is at least 28, but 75% of the time there is no direct link between any two random pages in the web.
  • Dill, Kumar et al. (2001) Self-Similarity in the Web - Shows that the web shows self-similarity on different levels, and that these properties are robust and fault-tolerant.

  • Guillaume, Latapy (2002) The Web Graph: an Overview (not reviewed)

Accessing the Web Graph

Dynamics

Mining the Hypertext

Classics

  • Kleinberg Jon (1997): Authoritative Sources in a Hyperlinked Environment Defines two measures for a web page: A "hub" score and an "authority" score. Hubs are pages pointing to good authorities, and good authorities are linked to by many good hubs. This was the first major idea for using link structures for page ranking, but some disadvantages (e.g. the fact that much computation has to be made at query time, and that the algorithm is prone to a "topic drift" phenomenon) made that Page Rank (below) became more popular.

  • Brin, Page (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine Describes properties of the (original) Google search engine (also contains some notes about its crawler). Introduces a measure of page quality, "Page Rank", which is formed by iteratively counting the weighted in-links of each page. This measure reveals some interesting properties, e.g. it equals the probability of a page being visited in a random walk. It is used today in some form by most major search engines as one factor for ranking search results, and has also been used for optimizing a web crawler (see above, URL-Reordering)

More about PageRank

Other

Index Maintenance

Miscellaneous

Exploiting Web Structure

  • Glover et al (2002a) Using Web Structure for Classifying and Describing Web Pages The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages

Other Bibliographies

Further Reading...

LuceneLARMPages/PapersOnCrawlers (last edited 2009-09-20 21:48:04 by localhost)