Options for intranet crawling that are not enabled by default
Here are some options you might want to add to your conf/nutch-site.xml configuration file if you plan on crawling your local network intranet that are not enabled by default.
Enable additional parser plugins
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|msexcel|mspowerpoint|msword|pdf|rss|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
This will enable the parser plugins for text, html, javascript, pdf, excel, powerpoint, word, pdf, rss and zip. There are additional parsers you can enable which are listed in conf/parse-plugins.xml. If you have additional document types you wish to parse and they are listed in the parse-plugins file, just add them to the list.
Increase the file size fetch limit
<property>
<name>http.content.limit</name> <value>2097152</value>
</property>
This will increase the default file size fetching limit to 2 megabytes. If your documents are larger (such as PDFs) then increase the number appropriately.