Clustering Plugin

plugin name


 Online Search Results Clustering using Carrot2 components

 

plugin version

 1.0.3
 

Plugin Info

Installation guide

Note that the user interface in default Nutch's Web application is very limited and you'll most likely need something more application-specific. Look at http://www.carrot2.org or http://www.carrot-search.com for inspiration.

Configuration guide

Libraries in this release are precompiled with stemming and stop words for various languages present in the Carrot2 codebase. You should define the default language and supported languages in Nutch configuration file (nutch-site.xml). If nothing is given in Nutch configuration English is used by default. The following properties can be added to nutch-site.xml:

<!-- Carrot2 Clustering plugin configuration -->

<property>
  <name>extension.clustering.carrot2.defaultLanguage</name>
  <value>en</value>
  <description>Two-letter ISO code of the language. 
  http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt</description>
</property>

<property>
  <name>extension.clustering.carrot2.languages</name>
  <value>en,nl,da,fi,fr,de,it,no,pl,pt,ru,es,sv,tr,ro,hu</value>
  <description>All languages to be used by the clustering plugin. 
  This list includes all currently supported languages (although not all of them
  will successfully instantiate -- support for Polish requires additional
  libraries for instance). Adjust to your needs, fewer languages take less
  memory.

  If you use the language recognizer plugin, then each hit will come with its
  own ISO language code. All hits with no explicit language take the default
  language specified in "extension.clustering.carrot2.defaultLanguage" property.
  </description>
</property>

Using other Carrot2 clustering algorithms

To limit the size of the clustering plugin, the default implementation is shipped with the Lingo algorithm – just one of several alternatives available in the Carrot2 project. This section describes how to substitute the default algorithm with a different one.

First, prepare the following:

Now you are ready to install a different clustering algorithm. The instructions below show how to run STC (Suffix Tree Clustering) instead of Lingo on the Jetty server (6.1.5). We will use a binary release of the DCS as a source of the required Carrot2 JARs.

<local-process id="stc-en">
  <name>STC (+English)</name>
  <description>Suffix Tree Clustering Algorithm</description>

  <input  component-key="input-demo-webapp" />

  <filter component-key="filter-language-detection-en" />
  <filter component-key="filter-tokenizer" />
  <filter component-key="filter-case-normalizer" />
  <filter component-key="filter-stc" />

  <output component-key="output-demo-webapp" />
</local-process>
<local-process id="stc-en">
  <name>STC (+English)</name>
  <description>Suffix Tree Clustering Algorithm</description>

  <input  component-key="input-nutch" />

  <filter component-key="filter-language-detection-en" />
  <filter component-key="filter-tokenizer" />
  <filter component-key="filter-case-normalizer" />
  <filter component-key="filter-stc" />

  <output component-key="output-array" />
</local-process>
<property>
  <name>extension.clustering.carrot2.process-resource</name>
  <value>/alg-stc-en.xml</value>
</property>