Differences between revisions 1 and 2
Revision 1 as of 2008-01-11 01:05:27
Size: 1220
Editor: 216
Comment:
Revision 2 as of 2008-01-11 01:08:03
Size: 1221
Editor: 216
Comment:
Deletions are marked like this. Additions are marked like this.
Line 13: Line 13:
This will enable the parser plugins for text, html, javascript, pdf, excel, powerpoint, word, pdf, rss and zip. There are additional parsers you can enable which are listed in conf/parse-plugins.xml. If you have additional document type you wish to parse and they are listed in the parse-plugins file, just add them to the list. This will enable the parser plugins for text, html, javascript, pdf, excel, powerpoint, word, pdf, rss and zip. There are additional parsers you can enable which are listed in conf/parse-plugins.xml. If you have additional document types you wish to parse and they are listed in the parse-plugins file, just add them to the list.

Options for intranet crawling that are not enabled by default

Here are some options you might want to add to your conf/nutch-site.xml configuration file if you plan on crawling your local network intranet that are not enabled by default.

Enable additional parser plugins

        <property>
                <name>plugin.includes</name>
                <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|msexcel|mspowerpoint|msword|pdf|rss|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
        </property>

This will enable the parser plugins for text, html, javascript, pdf, excel, powerpoint, word, pdf, rss and zip. There are additional parsers you can enable which are listed in conf/parse-plugins.xml. If you have additional document types you wish to parse and they are listed in the parse-plugins file, just add them to the list.

Increase the file size fetch limit

        <property>
                <name>http.content.limit</name> <value>2097152</value>
        </property>

This will increase the default file size fetching limit to 2 megabytes. If your documents are larger (such as PDFs) then increase the number appropriately.

NonDefaultIntranetCrawlingOptions (last edited 2011-07-27 21:45:45 by LewisJohnMcgibbney)