Options for intranet crawling that are not enabled by default

Here are some options you might want to add to your conf/nutch-site.xml configuration file if you plan on crawling your local network intranet that are not enabled by default.

Enable additional parser plugins

        <property>
                <name>plugin.includes</name>
                <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|msexcel|mspowerpoint|msword|pdf|rss|zip)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
        </property>

This will enable the parser plugins for text, html, javascript, pdf, excel, powerpoint, word, pdf, rss and zip. There are additional parsers you can enable which are listed in conf/parse-plugins.xml. If you have additional document type you wish to parse and they are listed in the parse-plugins file, just add them to the list.

Increase the file size fetch limit

        <property>
                <name>http.content.limit</name> <value>2097152</value>
        </property>

This will increase the default file size fetching limit to 2 megabytes. If your documents are larger (such as PDFs) then increase the number appropriately.