Differences between revisions 2 and 3
Revision 2 as of 2006-03-05 03:16:26
Size: 1836
Editor: JeffRitchie
Comment:
Revision 3 as of 2009-09-20 23:10:16
Size: 1836
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 8: Line 8:
  '''<segment>:''' Path to the segment to parse.[[BR]]   '''<segment>:''' Path to the segment to parse.<<BR>>
Line 11: Line 11:
 hadoop-default.xml[[BR]]
 hadoop-site.xml[[BR]]
 nutch-default.xml[[BR]]
 nutch-site.xml[[BR]]
 hadoop-default.xml<<BR>>
 hadoop-site.xml<<BR>>
 nutch-default.xml<<BR>>
 nutch-site.xml<<BR>>
Line 20: Line 20:
 The Parser depends upon a number of plugins to parse the various documents fetched from a crawl. Document types supported and the plugins needed are as follows:[[BR]][[BR]]  The Parser depends upon a number of plugins to parse the various documents fetched from a crawl. Document types supported and the plugins needed are as follows:<<BR>><<BR>>

"parse" is an alias for "org.apache.nutch.parse.ParseSegment"

Runs ParseSegment on a segment.

Usage

  • nutch-0.8-dev/bin/nutch org.apache.nutch.parse.ParseSegment <segment>

    • <segment>: Path to the segment to parse.

Configuration Files

  • hadoop-default.xml
    hadoop-site.xml
    nutch-default.xml
    nutch-site.xml

Other Files

  • None.

Caveats and Notes

  • The Parser depends upon a number of plugins to parse the various documents fetched from a crawl. Document types supported and the plugins needed are as follows:

    Content-type

    Plugin

    Notes

    text/html

    parse-html

    Parses html documents using NekoHTML or TagSoup

    application/x-javascript

    parse-js

    Parses JavaScript Documents (.js).

    audio/mpeg

    parse-mp3

    Parses MP3 Audio Documents (.mp3).

    application/vnd.ms-excel

    parse-msexcel

    Parses MSExcel Documents (.xls).

    application/vnd.ms-powerpoint

    parse-mspowerpoint

    Parses MSPower!Point Documents

    application/msword

    parse-msword

    Parses MSWord Documents

    application/rss+xml

    parse-rss

    Parses RSS Documents (.rss)

    application/rtf

    parse-rtf

    Parses RTF Documents (.rtf)

    application/pdf

    parse-pdf

    Parses PDF Documents

    application/x-shockwave-flash

    parse-swf

    Parses Flash Documents (.swf)

    text-plain

    parse-text

    Parses Text Documents (.txt)

    application/zip

    parse-zip

    Parses Zip Documents (.zip)

    other types

    parse-ext

    Parses Documents with external commands based upon content-type or pathSuffix

By default only text,html and js are enabled. The other plugins need to be enabled in nutch-site.xml.

DevelopmentCommandLineOptions

nutch-0.8-dev/bin/nutch_parse (last edited 2009-09-20 23:10:16 by localhost)