Options for intranet crawling that are not enabled by default
Here are some options you might want to add to your conf/nutch-site.xml configuration file if you plan on crawling your local network intranet that are not enabled by default.
Enable additional parser plugins
<property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|msexcel|mspowerpoint|msword|pdf|rss|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property>
Increase the file size fetch limit
<property> <name>http.content.limit</name> <value>2097152</value> </property>
This will increase the default file size fetching limit to 2 megabytes. If your documents are larger (such as PDFs) then increase the number appropriately.