"mergedb" is an alias for "org.apache.nutch.crawl.CrawlDbMerger"
Merges several CrawlDb(s) together. URLFilters can be optionaly used to filter out specific content.
You can merge several existing DBs into one. This comes useful if you ran several partial crawls and you'd like to combine the DBs. Optionally, you can run current URLFilters on URLs in the databases, to filter out unwanted URLs. This works also if you run it with just one input DB, which means that you can use this tool for weeding out unwanted URLs from a single DB.
It is possible to use this tool just for filtering - in that case only one crawldb should be specified in arguments.
If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of org.apache.nutch.crawl.CrawlDatum.getFetchTime(). However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.
nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.CrawlDbMerger output_crawldb crawldb1 [crawldb2 crawldb3 ...] [-filter]
output_crawldb: Output CrawlDb.
crawldb1 [crawldb2 crawldb3 ...]: One or many input CrawlDb(s).
-filter: Actual URLFilters to be applied on urls in CrawlDb(s).
Caveats and Notes