<!> Solr1.4

NOTE: This code is marked as experimental and is the APIs and responses are subject to change in future releases. See https://issues.apache.org/jira/browse/SOLR-769 for discussions around the development of this feature.

Introduction

This component can cluster both search results and documents. In case you're wondering what clustering is good for, think of it as a quick way to summarize a whole bunch of results/documents, or as a way to group together like results/documents.

See http://en.wikipedia.org/wiki/Data_clustering for more background, as well as links to further reading.

Clustering Component

The clustering implements a pluggable approach that allows for the implementation of any clustering engine.

The ClusteringComponent is responsible for taking in the request, identify the clustering engine to be used (a SolrClusteringEngine implementation) and then delegating the work to that engine. Once the engine is done, the results are then added to the response.

The ClusteringComponent currently does not support distributed processing.

Installation

The ClusteringComponent is in the contrib area of Solr. Due to some dependencies on LGPL libraries for the Carrot2 implementation, we cannot package a complete binary solution (with all the dependencies). To get the Carrot2 solution, you will need to download these libraries. To do this, on the command line in the contrib/clustering directory, run ant get-libraries. This will create a downloads directory under the lib directory for the downloaded jars.

Quick Start

Once you have downloaded the library dependencies, you can run the example using the following commands:

$ cd example
$ java -Dsolr.clustering.enabled=true -jar start.jar

This is the same as the main Solr example, using the same index, but with the clustering component and a SearchHandler configured to use that component enabled.

In a different window, add some docs using the post tool in the exampledocs directory (if you haven't already).

$ cd example/exampledocs
$ ./post.sh *.xml

Now try a query using the handler configured for clustering (It is confugred with clustering=true as a default param):

http://localhost:8983/solr/clustering?q=*:*&rows=10

This should yield results that include cluster information at the bottom of the response, like:

<arr name="clusters">
 <lst>
  <arr name="labels">
        <str>DDR</str>
  </arr>
  <arr name="docs">
        <str>TWINX2048-3200PRO</str>
        <str>VS1GB400C3</str>
        <str>VDBDB1A16</str>
  </arr>
 </lst>
 <lst>
  <arr name="labels">
        <str>Car Power Adapter</str>
  </arr>
  <arr name="docs">
        <str>F8V7067-APL-KIT</str>
        <str>IW-02</str>
  </arr>
 </lst>
 <lst>
  <arr name="labels">
        <str>Hard Drive</str>
  </arr>
  <arr name="docs">
        <str>SP2514N</str>
        <str>6H500F0</str>
  </arr>
 </lst>
 <lst>
[...]

Clusters produced by Carrot2 group the results into different product categories: DDR (memory), Car Power Adapter, Display, Hard Drive. Notice that, depending on the quality of input documents, some clusters may not make much sense.

Configuration

The ClusteringComponent gets added just like any other SearchComponent. Just declare it in the solrconfig.xml, as in:

<searchComponent class="org.apache.solr.handler.clustering.ClusteringComponent" name="clustering">
  <lst name="engine">
    <str name="name">default</str>
    <str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
  </lst>
</searchComponent>

Search Results Clustering

Carrot2 Clustering

Carrot2 is a scalable, BSD licensed search results clustering engine. It can cluster many different types of search results, including Y!, Google, etc. Our implementation, naturally, clusters Solr results.

Carrot2 is best suited for clustering small-to-medium collections of short documents. While Carrot2 may work for longer documents, processing times may be too long to meet on-line clustering requirements.

See http://project.carrot2.org

The configuration (solrconfig.xml) looks like:

<searchComponent class="org.apache.solr.handler.clustering.ClusteringComponent" name="clustering">
  <!-- Declare an engine -->
  <lst name="engine">
    <!-- The name, only one can be named "default" -->
    <str name="name">default</str>
    <!-- 
         Class name of Carrot2 clustering algorithm. Currently available algorithms are:
         
         * org.carrot2.clustering.lingo.LingoClusteringAlgorithm
         * org.carrot2.clustering.stc.STCClusteringAlgorithm
         
         See http://project.carrot2.org/algorithms.html for the algorithm's characteristics.
      -->
    <str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
    <!-- 
         Overriding values for Carrot2 default algorithm attributes. For a description
         of all available attributes, see: http://download.carrot2.org/stable/manual/#chapter.components.
         Use attribute key as name attribute of str elements below. These can be further
         overridden for individual requests by specifying attribute key as request
         parameter name and attribute value as parameter value.
      -->
    <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
  </lst>
</searchcomponent>

And the Standard ReqHandler looks like:

<requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <!-- 
       <int name="rows">10</int>
       <str name="fl">*</str>
       <str name="version">2.1</str>
        -->
       <!--<bool name="clustering">true</bool>-->
       <str name="clustering.engine">default</str>
       <bool name="clustering.results">true</bool>
       <!-- The title field -->
       <str name="carrot.title">name</str>
       <!-- The field to cluster on -->
       <str name="carrot.snippet">features</str>
       <str name="carrot.url">id</str>
       <!-- produce summaries -->
       <bool name="carrot.produceSummary">true</bool>
       <!-- the maximum number of labels per cluster -->
       <!--<int name="carrot.numDescriptions">5</int>-->
       <!-- produce sub clusters -->
       <bool name="carrot.outputSubClusters">false</bool>

     </lst>
    <arr name="last-components">
      <str>clustering</str>
    </arr>
  </requestHandler>

The thing to note here is the mapping of Solr Fields (name, id, etc.) to the Carrot2 needs of title, snippet and url. Clustering will take into account the text of title and snippet.

Tuning Carrot2 clustering

The easiest way to tune Carrot2 clustering for your specific data is to use a dedicated Carrot2 tool called Document Clustering Workbench.

  1. Download Carrot2 Document Clustering Workbench for your platform.

  2. Attach your Solr instance as a document source in the Workbench.

  3. Fine tune stop words, stop labels and possibly other attributes of the clustering algorithms to suit your needs.

  4. To transfer the modified stopwords.* and stoplabels.* files to your Solr instance, simply make the modified files accessible in the classpath. If you're using the Solr example scripts, try putting the files in the example/resources folder (Jetty starter from start.jar adds all files from that folder to the classpath). Alternatively, you can overwrite the corresponding stopwords.* and stoplabels.* files directly in carrot2-mini-*.jar.

Document Clustering

<!> THIS IS NOT FULLY IMPLEMENTED YET.

The Document Clustering implementation is designed to cluster whole documents across a collection. This can be done as an offline task. Once the clustering is done, the clusters can be retrieved.

Document Clustering is handled by using an implementation of the DocumentClusteringEngine. To invoke one, pass in the engine name, just as in the search results example, and also pass in the clustering.collection parameter (i.e. &clustering.collection=true). While this isn't fully worked out yet, it is likely that implementations will spawn a thread (or use a thread pool) that will perform the clustering asynchronously, returning some sort of identifier by which the clusters can be retrieved when done. Subsequent calls that use the identifier will then either return the clusters or return a percent complete.

<!> TODO <!> We likely also need a way of returning the status of all clustering tasks, that is if we support more than one task at a time.

See also Mahout: http://lucene.apache.org/mahout, which has several clustering algorithms implemented.

ClusteringComponent (last edited 2009-10-25 04:39:29 by HossMan)