Differences between revisions 5 and 6
Revision 5 as of 2006-10-16 14:01:50
Size: 1548
Comment: added info about -stat
Revision 6 as of 2009-09-20 23:09:42
Size: 1550
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
  '''<crawldb>:''' Path to the crawldb directory.[[BR]]
  '''[-stats]:''' Prints the overall statistics to System.out[[BR]]
  '''[-dump <out_dir>]:''' Exports the crawldb to a file in <out_dir>[[BR]]
  '''[-url <url>]:''' Prints statistics on <url> to System.out[[BR]]
  '''<crawldb>:''' Path to the crawldb directory.<<BR>>
  '''[-stats]:''' Prints the overall statistics to System.out<<BR>>
  '''[-dump <out_dir>]:''' Exports the crawldb to a file in <out_dir><<BR>>
  '''[-url <url>]:''' Prints statistics on <url> to System.out<<BR>>
Line 14: Line 14:
 hadoop-default.xml[[BR]]
 hadoop-site.xml[[BR]]
 nutch-default.xml[[BR]]
 nutch-site.xml[[BR]]
 hadoop-default.xml<<BR>>
 hadoop-site.xml<<BR>>
 nutch-default.xml<<BR>>
 nutch-site.xml<<BR>>
Line 30: Line 30:
 * (see [http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3C43C54E85.8040703@nutch.org%3E])  * (see [[http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3C43C54E85.8040703@nutch.org%3E]])

"readdb" is an alias for "org.apache.nutch.crawl.CrawlDbReader"

Returns or Exports information on the Crawl Database

Usage

  • nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -url <url>)

    • <crawldb>: Path to the crawldb directory.
      [-stats]: Prints the overall statistics to System.out
      [-dump <out_dir>]: Exports the crawldb to a file in <out_dir>
      [-url <url>]: Prints statistics on <url> to System.out

Configuration Files

  • hadoop-default.xml
    hadoop-site.xml
    nutch-default.xml
    nutch-site.xml

Other Files

  • None.

Caveats and Notes

stat command

  • the command -stat is quite useful to get a quick overview of the performed crawl. The output have following meaning:

  • DB_unfetched are pages that are linked to by fetched pages, but not fetched yet (because they are not passing the url filters or are not in the TopN links that Nutch selects for its next fetch cycle).
  • DB_gone means that a 404 or some other presumably permanent error was encountered. This status prevents future attempts to fetch a url.
  • DB_fetched is the number of document that have been fetched and indexed. That's what is important. If you have "status 2 (DB_fetched): 0", then something went wrong.
  • (see http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3C43C54E85.8040703@nutch.org%3E)

DevelopmentCommandLineOptions

nutch-0.8-dev/bin/nutch_readdb (last edited 2009-09-20 23:09:42 by localhost)