Options for intranet crawling that are not enabled by default

Here are some options you might want to add to your conf/nutch-site.xml configuration file if you plan on crawling your local network intranet that are not enabled by default.

Enable additional parser plugins


This will enable the parser plugins for text, html, javascript, pdf, excel, powerpoint, word, pdf, rss and zip. There are additional parsers you can enable which are listed in conf/parse-plugins.xml. If you have additional document types you wish to parse and they are listed in the parse-plugins file, just add them to the list.

Increase the file size fetch limit

                <name>http.content.limit</name> <value>2097152</value>

This will increase the default file size fetching limit to 2 megabytes. If your documents are larger (such as PDFs) then increase the number appropriately.

