Differences between revisions 3 and 4
Revision 3 as of 2006-05-10 17:39:38
Size: 3482
Editor: LukasVlcek
Comment:
Revision 4 as of 2009-09-20 23:09:45
Size: 3482
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 15: Line 15:
'''Which parts are merged?'''[[BR]] '''Which parts are merged?'''<<BR>>
Line 25: Line 25:
'''Merging fetchlists'''[[BR]] '''Merging fetchlists'''<<BR>>
Line 30: Line 30:
'''Duplicate content'''[[BR]] '''Duplicate content'''<<BR>>
Line 43: Line 43:
'''Merging and indexes'''[[BR]] '''Merging and indexes'''<<BR>>
Line 53: Line 53:
  '''output_dir:''' Name of the resulting segment, or the parent dir of segment slices.[[BR]]
  '''-dir segments:''' Parent dir containing several segments.[[BR]]
  '''seg1 seg2 ...:''' List of segment dirs.[[BR]]
  '''-filter:''' Filter out URL-s prohibited by current URLFilters.[[BR]]
  '''-slice NNNN:''' Create many output segments, each containing NNNN URLs.[[BR]]
  '''output_dir:''' Name of the resulting segment, or the parent dir of segment slices.<<BR>>
  '''-dir segments:''' Parent dir containing several segments.<<BR>>
  '''seg1 seg2 ...:''' List of segment dirs.<<BR>>
  '''-filter:''' Filter out URL-s prohibited by current URLFilters.<<BR>>
  '''-slice NNNN:''' Create many output segments, each containing NNNN URLs.<<BR>>
Line 60: Line 60:
 hadoop-default.xml[[BR]]
 hadoop-site.xml[[BR]]
 nutch-default.xml[[BR]]
 nutch-site.xml[[BR]]
 hadoop-default.xml<<BR>>
 hadoop-site.xml<<BR>>
 nutch-default.xml<<BR>>
 nutch-site.xml<<BR>>

"mergesegs" is an alias for "org.apache.nutch.segment.SegmentMerger"

Merges several input segments together and optionally it can output into one or more segments of fixed size.

This tool merges several input segments into one or more output segments, with optional filtering as above. The output data can be divided into several smaller segments of fixed size. Only the latest versions of data is retained. Optionally, you can apply current URLFilters to remove prohibited URL(s).

The purpose of this tool is to e.g. re-shape your segments (in preparation for deployment to search servers), or to filter out unwanted data, or to minimize the number of active segments.

Which parts are merged?
It doesn't make sense to merge data from segments, which are at different stages of processing (e.g. one unfetched segment, one fetched but not parsed, and one fetched and parsed). Therefore, prior to merging, the tool will determine the lowest common set of input data, and only this data will be merged. This may have some unintended consequences: e.g. if majority of input segments are fetched and parsed, but one of them is unfetched, the tool will fall back to just merging fetchlists, and it will skip all other data from all segments.

Merging fetchlists
Merging segments, which contain just fetchlists (i.e. prior to fetching) is not recommended, because this tool (unlike the org.apache.nutch.crawl.Generator doesn't ensure that fetchlist parts for each map task are disjoint.

Duplicate content
Merging segments removes older content whenever possible (see below). However, this is NOT the same as de-duplication, which in addition removes identical content found at different URL(s). In other words, running DeleteDuplicates is still necessary.

For some types of data (especially ParseText) it's not possible to determine which version is really older. Therefore the tool always uses segment names as timestamps, for all types of input data. Segment names are compared in forward lexicographic order (0-9a-zA-Z), and data from segments with "higher" names will prevail. It follows then that it is extremely important that segments be named in an increasing lexicographic order as their creation time increases.

Merging and indexes
Merged segment gets a different name. Since Indexer embeds segment names in indexes, any indexes originally created for the input segments will NOT work with the merged segment. Newly created merged segment(s) need to be indexed afresh. This tool doesn't use existing indexes in any way, so if you plan to merge segments you don't have to index them prior to merging.

Usage

  • nutch-0.8-dev/bin/nutch org.apache.nutch.segment.SegmentMerger output_dir (-dir segments | seg1 seg2 ...) [-filter] [-slice NNNN]

    • output_dir: Name of the resulting segment, or the parent dir of segment slices.
      -dir segments: Parent dir containing several segments.
      seg1 seg2 ...: List of segment dirs.
      -filter: Filter out URL-s prohibited by current URLFilters.
      -slice NNNN: Create many output segments, each containing NNNN URLs.

Configuration Files

  • hadoop-default.xml
    hadoop-site.xml
    nutch-default.xml
    nutch-site.xml

Other Files

  • None.

Caveats and Notes

  • None.

DevelopmentCommandLineOptions

nutch-0.8-dev/bin/nutch_mergesegs (last edited 2009-09-20 23:09:45 by localhost)