"mergedb" is an alias for "org.apache.nutch.crawl.CrawlDbMerger"

Merges several CrawlDb(s) together. URLFilters can be optionaly used to filter out specific content.

You can merge several existing DBs into one. This comes useful if you ran several partial crawls and you'd like to combine the DBs. Optionally, you can run current URLFilters on URLs in the databases, to filter out unwanted URLs. This works also if you run it with just one input DB, which means that you can use this tool for weeding out unwanted URLs from a single DB.

It is possible to use this tool just for filtering - in that case only one crawldb should be specified in arguments.

If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of org.apache.nutch.crawl.CrawlDatum.getFetchTime(). However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.

Usage

Configuration Files

Other Files

Caveats and Notes

DevelopmentCommandLineOptions

nutch-0.8-dev/bin/nutch_mergedb (last edited 2009-09-20 23:10:06 by localhost)