Options for intranet crawling that are not enabled by default
Here are some options you might want to add to your conf/nutch-site.xml configuration file if you plan on crawling your local network intranet. You will notice that some plugins are not enabled by default but accurately reflect the type of data present on the typical enterprise intranet.
Enable additional parser plugins
<property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika|zip|js|swf|feed)|index-(basic|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property>
Increase the file size fetch limit
<property> <name>http.content.limit</name> <value>2097152</value> </property>
This will increase the default file size fetching limit to 2 megabytes. If your documents are larger (such as PDFs) then increase the number appropriately.