Differences between revisions 4 and 5
Revision 4 as of 2006-05-10 17:38:18
Size: 1584
Editor: LukasVlcek
Comment:
Revision 5 as of 2009-09-20 23:10:06
Size: 1584
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 23: Line 23:
  '''output_crawldb:''' Output !CrawlDb.[[BR]]
  '''crawldb1 [crawldb2 crawldb3 ...]:''' One or many input !CrawlDb(s).[[BR]]
  '''-filter:''' Actual URLFilters to be applied on urls in !CrawlDb(s).[[BR]]
  '''output_crawldb:''' Output !CrawlDb.<<BR>>
  '''crawldb1 [crawldb2 crawldb3 ...]:''' One or many input !CrawlDb(s).<<BR>>
  '''-filter:''' Actual URLFilters to be applied on urls in !CrawlDb(s).<<BR>>
Line 28: Line 28:
 hadoop-default.xml[[BR]]
 hadoop-site.xml[[BR]]
 nutch-default.xml[[BR]]
 nutch-site.xml[[BR]]
 hadoop-default.xml<<BR>>
 hadoop-site.xml<<BR>>
 nutch-default.xml<<BR>>
 nutch-site.xml<<BR>>

"mergedb" is an alias for "org.apache.nutch.crawl.CrawlDbMerger"

Merges several CrawlDb(s) together. URLFilters can be optionaly used to filter out specific content.

You can merge several existing DBs into one. This comes useful if you ran several partial crawls and you'd like to combine the DBs. Optionally, you can run current URLFilters on URLs in the databases, to filter out unwanted URLs. This works also if you run it with just one input DB, which means that you can use this tool for weeding out unwanted URLs from a single DB.

It is possible to use this tool just for filtering - in that case only one crawldb should be specified in arguments.

If more than one CrawlDb contains information about the same URL, only the most recent version is retained, as determined by the value of org.apache.nutch.crawl.CrawlDatum.getFetchTime(). However, all metadata information from all versions is accumulated, with newer values taking precedence over older values.

Usage

  • nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.CrawlDbMerger output_crawldb crawldb1 [crawldb2 crawldb3 ...] [-filter]

    • output_crawldb: Output CrawlDb.
      crawldb1 [crawldb2 crawldb3 ...]: One or many input CrawlDb(s).
      -filter: Actual URLFilters to be applied on urls in CrawlDb(s).

Configuration Files

  • hadoop-default.xml
    hadoop-site.xml
    nutch-default.xml
    nutch-site.xml

Other Files

  • None.

Caveats and Notes

  • None.

DevelopmentCommandLineOptions

nutch-0.8-dev/bin/nutch_mergedb (last edited 2009-09-20 23:10:06 by localhost)