"readdb" is an alias for "org.apache.nutch.crawl.CrawlDbReader"
Returns or Exports information on the Crawl Database
Usage
nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -url <url>)
<crawldb>: Path to the crawldb directory.
[-stats]: Prints the overall statistics to System.out
[-dump <out_dir>]: Exports the crawldb to a file in <out_dir>
[-url <url>]: Prints statistics on <url> to System.out
Configuration Files
hadoop-default.xml
hadoop-site.xml
nutch-default.xml
nutch-site.xml
Other Files
- None.
Caveats and Notes
stat command
the command -stat is quite useful to get a quick overview of the performed crawl. The output have following meaning:
- DB_unfetched are pages that are linked to by fetched pages, but not fetched yet (because they are not passing the url filters or are not in the TopN links that Nutch selects for its next fetch cycle).
- DB_gone means that a 404 or some other presumably permanent error was encountered. This status prevents future attempts to fetch a url.
- DB_fetched is the number of document that have been fetched and indexed. That's what is important. If you have "status 2 (DB_fetched): 0", then something went wrong.