Plugins provide a large part of the functionality of nutch. For instance, the
HtmlParser is what gets used to parse the the html documents that nutch fetches.
AboutPlugins - General information on what plugins are and how they work.
WritingPluginExample-0.9 - Step-by-step example of how to write a plugin for the current development.
WritingPluginExample - A step-by-step example of how to write a plugin for the 0.7 branch. - updated by LucasBoullosa
Writing Plugins - by Stefan
Plugins that Come with Nutch (0.7)
In order to get Nutch to use any of these plugins, you just need to edit your conf/nutch-site.xml file and add the name of the plugin to the list of plugin.includes.
clustering-carrot2 - Online Search Results Clustering using Carrot2's components.
creativecommons - Support for crawling and searching Creative-Commons licensed content.
index-basic - Adds url, content and anchor fields to the index.
index-more - Adds date, content-length, contentType, primaryType and subtype fields to the index.
languageidentifier - Adds a lang field to the index and allows you to query against it.
ontology - Helps refine queries based on owl files.
parse-ext - A wrapper that invokes external command to do real parsing job.
parse-html - Parses HTML documents
parse-js - Parses JavaScript
parse-mp3 - Parses MP3s
parse-msword - Parses MS Word documents
parse-pdf - Parses PDFs
parse-rss - Parses RSS feeds
parse-rtf - Parses RTF files
parse-text - Parses text documents
protocol-file - Retreives documents from the filesystem
protocol-ftp - Retreives documents through ftp
protocol-http - Retreives documents through http
protocol-httpclient - Retreives documents through http and https
query-basic - Runs queries against content, url and anchor fields
query-more - Runs queries against date, content-length, contentType, primaryType and subType fields.
query-site - Runs queries against site field
query-url - Runs queries against url field.
urlfilter-prefix
urlfilter-regex
Additional Plugins in Dev Branch (0.8)
analysis-de
analysis-fr
lib-commons-httpclient
lib-http
lib-jakarta-poi
lib-log4j
lib-lucene-analyzers
lib-nekohtml
lib-parsems
parse-msexcel - Parses MS Excel documents
parse-mspowerpoint - Parses MS Powerpoint documents
parse-oo - Parses Open Office and Star Office documents (Extentsions: ODT, OTT, ODH, ODM, ODS, OTS, ODP, OTP, SXW, STW, SXC, STC, SXI, STI)
parse-swf - Parses Flash SWF files
microformats-reltag - Adds
rel-tag fields to the index and runs queries against them. parse-zip
Plugins You can Download
XMLParser Plugin (parse-xml : parse xml documents using XPath and namespaces)
index-extra - Adds user-configurable fields to the index.
protocol-smb - Allows Nutch to crawl MS Windows Shares folder. protocol-http11 - Adds support for HTTP 1.1, HTTPS, Basic, Digest and NTLM authentication. (
NUTCH-557)