Differences between revisions 2 and 3
Revision 2 as of 2006-05-10 17:38:43
Size: 1642
Editor: LukasVlcek
Comment:
Revision 3 as of 2009-09-20 23:09:53
Size: 1642
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 26: Line 26:
  '''output_linkdb:''' Output !LinkDb.[[BR]]
  '''linkdb1 [linkdb2 linkdb3 ...]:''' One or many input !LinkDb(s).[[BR]]
  '''-filter:''' Actual URLFilters to be applied on urls and links in !LinkDb(s).[[BR]]
  '''output_linkdb:''' Output !LinkDb.<<BR>>
  '''linkdb1 [linkdb2 linkdb3 ...]:''' One or many input !LinkDb(s).<<BR>>
  '''-filter:''' Actual URLFilters to be applied on urls and links in !LinkDb(s).<<BR>>
Line 31: Line 31:
 hadoop-default.xml[[BR]]
 hadoop-site.xml[[BR]]
 nutch-default.xml[[BR]]
 nutch-site.xml[[BR]]
 hadoop-default.xml<<BR>>
 hadoop-site.xml<<BR>>
 nutch-default.xml<<BR>>
 nutch-site.xml<<BR>>

"mergelinkdb" is an alias for "org.apache.nutch.crawl.LinkDbMerger"

Merges several LinkDb(s) together. URLFilters can be optionaly used to filter out specific content.

This tool can be useful if you built partial LinkDb(s) from groups of segments, and then you need to integrate them into one (e.g. for indexing or for searching). Or you can use it with a single LinkDb, just to filter out unwanted URLs and links.

It's possible to use this tool just for filtering - in that case only one LinkDb should be specified in arguments.

If more than one LinkDb contains information about the same URL, all inlinks are accumulated, but only at most db.max.inlinks inlinks will ever be added.

If activated, URLFilters will be applied to both the target URLs and to any incoming link URL. If a target URL is prohibited, all inlinks to that target will be removed, including the target URL. If some of incoming links are prohibited, only they will be removed, and they won't count when checking the above-mentioned maximum limit.

Usage

  • nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.LinkDbMerger output_linkdb linkdb1 [linkdb2 linkdb3 ...] [-filter]

    • output_linkdb: Output LinkDb.
      linkdb1 [linkdb2 linkdb3 ...]: One or many input LinkDb(s).
      -filter: Actual URLFilters to be applied on urls and links in LinkDb(s).

Configuration Files

  • hadoop-default.xml
    hadoop-site.xml
    nutch-default.xml
    nutch-site.xml

Other Files

  • None.

Caveats and Notes

  • None.

DevelopmentCommandLineOptions

nutch-0.8-dev/bin/nutch_mergelinkdb (last edited 2009-09-20 23:09:53 by localhost)