Differences between revisions 4 and 5
Revision 4 as of 2006-01-09 22:52:03
Size: 1762
Editor: 70-58-85-223
Comment: fixed classpath to org.apache
Revision 5 as of 2009-09-20 23:09:42
Size: 1762
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 13: Line 13:
Usage: bin/nutch org.apache.nutch.segment.!SegmentSlicer (-local | -ndfs <namenode:port>) -o outputDir [-max count] [-fix] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)[[BR]] Usage: bin/nutch org.apache.nutch.segment.!SegmentSlicer (-local | -ndfs <namenode:port>) -o outputDir [-max count] [-fix] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)<<BR>>
Line 15: Line 15:
outputDir is always required.[[BR]]
-o outputDir[[BR]]
  output directory for segments[[BR]]
-max count[[BR]]
  (optional) output multiple segments, each with maximum 'count' entries[[BR]]
  (optional) automatically fix corrupted segments[[BR]]

  (optional) ignore content data[[BR]]

  (optional) ignore parse_data data[[BR]]

  (optional) ignore parse_text data[[BR]]
-dir segments[[BR]]

  directory containing multiple segments[[BR]]
seg1 seg2 ...[[BR]]
  segment directories[[BR]]
outputDir is always required.<<BR>>
-o outputDir<<BR>>
  output directory for segments<<BR>>
-max count<<BR>>
  (optional) output multiple segments, each with maximum 'count' entries<<BR>>
  (optional) automatically fix corrupted segments<<BR>>

  (optional) ignore content data<<BR>>

  (optional) ignore parse_data data<<BR>>

  (optional) ignore parse_text data<<BR>>
-dir segments<<BR>>

  directory containing multiple segments<<BR>>
seg1 seg2 ...<<BR>>
  segment directories<<BR>>

segslice is an alias for org.apache.nutch.segment.SegmentSlicer

This class reads data from one or more input segments, and outputs it to one or more output segments, optionally deleting the input segments when it's finished.

Data is read sequentially from input segments, and appended to output segment until it reaches the target count of entries, at which point the next output segment is created, and so on.

NOTE 1: this tool does NOT de-duplicate data - use SegmentMergeTool for that.

NOTE 2: this tool does NOT copy indexes. It is currently impossible to slice Lucene indexes. The proper procedure is first to create slices, and then to index them.

NOTE 3: if one or more input segments are in non-parsed format, the output segments will also use non-parsed format. This means that any parseData and parseText data from input segments will NOT be copied to the output segments.

Usage: bin/nutch org.apache.nutch.segment.SegmentSlicer (-local | -ndfs <namenode:port>) -o outputDir [-max count] [-fix] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)
NOTE: at least one segment dir name is required, or '-dir' option. outputDir is always required.
-o outputDir

  • output directory for segments

-max count

  • (optional) output multiple segments, each with maximum 'count' entries


  • (optional) automatically fix corrupted segments


  • (optional) ignore content data


  • (optional) ignore parse_data data


  • (optional) ignore parse_text data

-dir segments

  • directory containing multiple segments

seg1 seg2 ...

  • segment directories


bin/nutch_segslice (last edited 2009-09-20 23:09:42 by localhost)