"readdb" is an alias for "org.apache.nutch.crawl.CrawlDbReader"

Returns or Exports information on the Crawl Database


  • nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -url <url>)

    • <crawldb>: Path to the crawldb directory.
      [-stats]: Prints the overall statistics to System.out
      [-dump <out_dir>]: Exports the crawldb to a file in <out_dir>
      [-url <url>]: Prints statistics on <url> to System.out

Configuration Files

  • hadoop-default.xml

Other Files

  • None.

Caveats and Notes

stat command

  • the command -stat is quite useful to get a quick overview of the performed crawl. The output have following meaning:

  • DB_unfetched are pages that are linked to by fetched pages, but not fetched yet (because they are not passing the url filters or are not in the TopN links that Nutch selects for its next fetch cycle).
  • DB_gone means that a 404 or some other presumably permanent error was encountered. This status prevents future attempts to fetch a url.
  • DB_fetched is the number of document that have been fetched and indexed. That's what is important. If you have "status 2 (DB_fetched): 0", then something went wrong.
  • (see http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/%3C43C54E85.8040703@nutch.org%3E)


