Differences between revisions 5 and 6
Revision 5 as of 2006-03-06 10:05:47
Size: 1078
Comment:
Revision 6 as of 2009-09-20 23:10:05
Size: 1078
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
  '''<segment>:''' Path to the segment to read.[[BR]]   '''<segment>:''' Path to the segment to read.<<BR>>
Line 11: Line 11:
 hadoop-default.xml[[BR]]
 hadoop-site.xml[[BR]]
 nutch-default.xml[[BR]]
 nutch-site.xml[[BR]]
 hadoop-default.xml<<BR>>
 hadoop-site.xml<<BR>>
 nutch-default.xml<<BR>>
 nutch-site.xml<<BR>>

"segread" is an alias for "org.apache.nutch.segment.SegmentReader"

Reads and Exports a Segments Data

Usage

  • nutch-0.8-dev/bin/nutch org.apache.nutch.segment.SegmentReader <segment>

    • <segment>: Path to the segment to read.

Configuration Files

  • hadoop-default.xml
    hadoop-site.xml
    nutch-default.xml
    nutch-site.xml

Other Files

  • None.

Caveats and Notes

  • Creates a directory in <segment> called segdump. Within that directory a number of files are created. A dump file called dump and several other files prefixed part-. The dump file contains some readable information about the pages fetched and their parsed information. The part files are consolidated together to form the dump file and can be deleted. Do not 'cat' these files if in a term as it does contain some binary data that will corrupt your terminal (however, if you end up in such state, you can reset your terminal with 'stty sane' or if this fails with 'reset').

DevelopmentCommandLineOptions

nutch-0.8-dev/bin/nutch_segread (last edited 2009-09-20 23:10:05 by localhost)