segslice is an alias for org.apache.nutch.segment.SegmentSlicer

This class reads data from one or more input segments, and outputs it to one or more output segments, optionally deleting the input segments when it's finished.

Data is read sequentially from input segments, and appended to output segment until it reaches the target count of entries, at which point the next output segment is created, and so on.

NOTE 1: this tool does NOT de-duplicate data - use SegmentMergeTool for that.

NOTE 2: this tool does NOT copy indexes. It is currently impossible to slice Lucene indexes. The proper procedure is first to create slices, and then to index them.

NOTE 3: if one or more input segments are in non-parsed format, the output segments will also use non-parsed format. This means that any parseData and parseText data from input segments will NOT be copied to the output segments.

Usage: bin/nutch org.apache.nutch.segment.SegmentSlicer (-local | -ndfs <namenode:port>) -o outputDir [-max count] [-fix] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)
NOTE: at least one segment dir name is required, or '-dir' option. outputDir is always required.
-o outputDir

-max count





-dir segments

seg1 seg2 ...


bin/nutch_segslice (last edited 2009-09-20 23:09:42 by localhost)