parse is an alias for org.apache.nutch.tools.ParseSegment
Parse contents in one segment.
It assumes, under given segment, existence of ./fetcher_output/, which is typically generated after a non-parsing fetcher run (i.e., fetcher is started with option -noParsing).
Contents in one segment are parsed and saved in these steps:
1. ./fetcher_output/ and ./content/ are looped together (possibly by multiple ParserThreads), and content is parsed for each entry. The entry number and resultant ParserOutput are saved in ./parser.unsorted.
2. ./parser.unsorted is sorted by entry number, result saved as ./parser.sorted.
3. ./parser.sorted and ./fetcher_output/ are looped together. At each entry, ParserOutput is split into ParseDate and ParseText, which are saved in ./parse_data/ and ./parse_text/ respectively. Also updated is FetcherOutput with parsing status, which is saved in ./fetcher/. In the end, ./fetcher/ should be identical to one resulted from fetcher run WITHOUT option -noParsing.
By default, intermediates ./parser.unsorted and ./parser.sorted are removed at the end, unless option -noClean is used. However ./fetcher_output/ is kept intact.
Check Fetcher.java and FetcherOutput.java for further discussion.
Usage: bin/nutch org.apache.nutch.tools.ParseSegment (-local | -ndfs <namenode:port>) [-threads n] [-showThreadID] [-dryRun] [-logLevel level] [-noClean] dir