"parse" is an alias for "org.apache.nutch.parse.ParseSegment"
Runs ParseSegment on a segment.
Usage
nutch-0.8-dev/bin/nutch org.apache.nutch.parse.ParseSegment <segment>
<segment>: Path to the segment to parse.
Configuration Files
hadoop-default.xml
hadoop-site.xml
nutch-default.xml
nutch-site.xml
Other Files
- None.
Caveats and Notes
The Parser depends upon a number of plugins to parse the various documents fetched from a crawl. Document types supported and the plugins needed are as follows:
Content-type
Plugin
Notes
text/html
parse-html
Parses html documents using NekoHTML or TagSoup
application/x-javascript
parse-js
Parses JavaScript Documents (.js).
audio/mpeg
parse-mp3
Parses MP3 Audio Documents (.mp3).
application/vnd.ms-excel
parse-msexcel
Parses MSExcel Documents (.xls).
application/vnd.ms-powerpoint
parse-mspowerpoint
Parses MSPower!Point Documents
application/msword
parse-msword
Parses MSWord Documents
application/rss+xml
parse-rss
Parses RSS Documents (.rss)
application/rtf
parse-rtf
Parses RTF Documents (.rtf)
application/pdf
parse-pdf
Parses PDF Documents
application/x-shockwave-flash
parse-swf
Parses Flash Documents (.swf)
text-plain
parse-text
Parses Text Documents (.txt)
application/zip
parse-zip
Parses Zip Documents (.zip)
other types
parse-ext
Parses Documents with external commands based upon content-type or pathSuffix
By default only text,html and js are enabled. The other plugins need to be enabled in nutch-site.xml.