Clustering Component

This search component can cluster both search results and documents. In case you're wondering what clustering is good for, think of it as a quick way to summarize a whole bunch of results/documents, or as a way to group together semantically related results/documents. See http://en.wikipedia.org/wiki/Data_clustering for more background, and Carrot2 demo for a live example. More information about Solr-Carrot2 integration strategies, including per-Solr-version examples are available at http://carrot2.github.io/solr-integration-strategies/

Jump to: Quick Start | Configuration | Parameters

Note: This code is marked as experimental, the APIs and responses are subject to change in future releases. See clustering component issues in JIRA for discussions around the development of this feature.

Overview

The ClusteringComponent implements a pluggable approach that allows for the implementation of any clustering engine. The component is responsible for taking in the request, identifying the clustering engine to be used (a SolrClusteringEngine implementation) and then delegating the work to that engine. Once the engine is done, the results are added to the response.

Search results clustering implementation (CarrotClusteringEngine) is based on the Carrot2 framework.

<!> Solr3.1 The ClusteringComponent supports distributed processing, except the carrot.produceSummary parameter (please see SOLR-2282 about the restriction).

Quick Start

  1. <!> Solr1.4 Download library dependencies

  2. Run the example Solr configuration using the following commands:
    $ cd example
    $ java -Dsolr.clustering.enabled=true -jar start.jar

    The command uses the same configuration and index as the main Solr example, but it additionally enables the ClusteringComponent and a dedicated SearchHandler configured to use that component.

  3. In a different window, add some docs using the post tool in the exampledocs directory (if you haven't already):

    $ cd example/exampledocs
    $ ./post.sh *.xml
  4. Try a query using the clustering handler:

    http://localhost:8983/solr/clustering?q=*:*&rows=10
    This should yield results that include cluster information at the bottom of the response:
    <arr name="clusters">
      <lst>
        <arr name="labels">
          <str>iPod</str>
        </arr>
        <double name="score">3.1654221261111397</double>
        <arr name="docs">
          <str>F8V7067-APL-KIT</str>
          <str>IW-02</str>
          <str>MA147LL/A</str>
        </arr>
      </lst>
      <lst>
        <arr name="labels">
          <str>Car Power Adapter</str>
        </arr>
        [...]
      </lst>
      <lst>
        <arr name="labels">
          <str>Hard Drive</str>
        </arr>
        [...]
      </lst>
      <lst>
        <arr name="labels">
          <str>USB 2.0</str>
        </arr>
        [...]
      </lst>
      <lst>
        <arr name="labels">
          <str>Other Topics</str>
        </arr>
        <double name="score">0.0</double>
        <bool name="other-topics">true</bool>
        <arr name="docs">
          <str>GB18030TEST</str>
          <str>adata</str>
          [...]
        </arr>
      </lst>
    </arr>

    Clusters produced by Carrot2 group the results into different product categories: iPad, Car Power Adapter, Hard Drive, USB 2.0. A few things to notice:

    • Each cluster has a score that indicates the "goodness" of the cluster. The score is algorithm-specific and is meaningful only in relation to the scores of other clusters in the same set. In other words, if cluster A has higher score than cluster B, the algorithm "thinks" cluster A is better, e.g. has a better label and/or more coherent document set.

    • Each cluster has an array of identifiers of documents contained in it.
    • Depending on the quality of input documents, some clusters may not make much sense
    • Some documents may be left in the Other Topics group. Such a group is marked with the other-topics property set to true.

Installation

Downloading dependencies

<!> Solr3.1 <!> Solr4.0 Carrot2 is fully integrated into Solr and does not require special downloads.

<!> Solr1.4 Due to some dependencies on LGPL libraries for the Carrot2 implementation, we cannot package a complete binary solution (with all the dependencies). To get the Carrot2 solution, on the command line in the contrib/clustering directory, run ant get-libraries. This will create a downloads directory under the lib directory for the downloaded JARs.

Installation on Tomcat

  1. Copy all the JAR files from contrib/clustering ( <!> Solr1.4 and contrib/clustering/downloads, see above) into ${solr.home}/lib.

  2. Copy the dist/apache-solr-clustering-*.jar file into your ${solr.home}/lib directory. This is needed to run the clustering component.

  3. Enable the clustering component by adding -Dsolr.clustering.enabled=true to $CATALINA_OPTS.

  4. Restart Tomcat and verify that the log file is error free.

The default ${solr.home}/conf/solrconfig.xml contains a preconfigured ClusteringComponent, but you may need to edit the parameters, such as carrot.title and carrot.snippet, to match your schema.

Configuration

  1. Add ClusteringComponent to your solrconfig.xml, just like any other SearchComponent:

    <searchComponent class="org.apache.solr.handler.clustering.ClusteringComponent" name="clustering">
      <lst name="engine">
        <str name="name">default</str>
        <str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
    
        <!-- Engine-specific parameters -->
        <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
      </lst>
    </searchComponent>

    Within the clustering search component declaration you can configure several engines with different clustering algorithms and algorithm-specific parameters. Each of these engines can be selected at query time using the clustering.engine parameter.

  2. Reference the clustering component in your request handler:
    <requestHandler name="standard" class="solr.SearchHandler" default="true">
      <lst name="defaults">
        <str name="echoParams">explicit</str>
    
        <bool name="clustering">true</bool>
        <str name="clustering.engine">default</str>
        <bool name="clustering.results">true</bool>
    
        <!-- Fields to cluster on -->
        <str name="carrot.title">name</str>
        <str name="carrot.snippet">features</str>
      </lst>
      <arr name="last-components">
        <str>clustering</str>
      </arr>
    </requestHandler>

    The request handler configuration needs to set a number of common parameters, such as the type of clustering (clustering.results), and a number of engine-specific parameters, e.g. fields to cluster on (carrot.title, carrot.snippet).

<!> Solr4.5 Starting with Solr 4.5 there is no need to declare a "default" engine. The first declared engine becomes the default one and it can be overridden with clustering.engine request parameter.

Parameters

This section lists common parameters that apply to both search results and document clustering. Please also see the search results clustering parameters.

clustering

When true, clustering is enabled.

clustering.engine

The clustering engine to use. If not specified, the engine named default will be used (or, starting with <!> Solr4.5, the first declared engine).

clustering.results

When true, the component will perform clustering of search results.

clustering.collection

When true, the component will perform clustering of the whole document index.

Search Results Clustering

Search results clustering is a technique of post-processing of search results that aims to group them into thematically related categories. For example, when clustering web search results for the 'apache' query, one will likely see groups related to the Apache Software Foundation, Apache Web Server, but also groups about Apache County, Apache Indians or the Attack Helicopter.

Solr search results clustering is based on the Carrot2 real-time document clustering engine. Carrot2 offers two specialized search results clustering algorithms that emphasize the quality of cluster labels.

Input for clustering

Carrot2 is best suited for clustering small-to-medium collections of short documents. While it may work for longer documents, processing times may be too long to meet on-line clustering requirements.

Carrot2 assumes that each search result provided on input can consist of three types of fields: document title, document content/snippet and URL. Document title is required, content/snippet and URL are optional. The reason to distinguish between the document's title and content is that Carrot2 can give more weight to the titles, which increases the quality of clusters and labels. Carrot2 needs at least about 20 search results to generate meaningful clusters. For more information, please see the desired qualities of the documents for clustering in Carrot2 manual.

Note: Carrot2 can only perform clustering on stored fields. The reason for this is that Carrot2 aims to create meaningful cluster labels by using phrases (sequences of words) taken directly from the documents' text. The easiest way of providing input for such a process is feeding Carrot2 with raw (stored) document content. As a result, character and token filters are currently ignored. There are plans to implement support for character and selected token filters during clustering: https://issues.apache.org/jira/browse/SOLR-2917.

Parameters

carrot.algorithm

The fully qualified class name of the Carrot2 clustering algorithm to use. Currently, the following algorithms are available:

Please see http://project.carrot2.org/algorithms.html for the characteristics of these algorithms and clustering algorithm choice guidance in Carrot2 manual.

Note: This parameter must be specified in the clustering component configuration (in the engine section) and cannot be overridden at query time.

carrot.title

The Solr field ( <!> Solr3.6: comma- or space-separated list of fields) that the clustering engine should treat as the hit document's title. It must be a stored field (or fields).

Carrot2 will give more weight to the content of this field compared to carrot.snippet. For best results, the field should contain concise, noise-free content.

If your schema does not distinguish the document's title and content, you can provide your content in carrot.title and leave carrot.snippet empty.

carrot.snippet

The Solr field ( <!> Solr3.6: comma- or space-separated list of fields) that the clustering engine should treat as the hit document's content. It must be a stored field (or fields).

For best results, the snippet should contain a summary of the document, e.g. an abstract or the first content paragraph. Very long snippet fields will significantly increase the clustering time, unless carrot.produceSummary is enabled.

carrot.url

The Solr field that the clustering engine should treat as the hit document's target URL. Must be a stored field. This mapping is optional.

The URL field is currently not used by the Carrot2 algorithms.

carrot.lang

<!> Solr3.6

The Solr field that the clustering engine should treat as the search results's ISO 639 two-letter language code. In case of multilingual result sets, providing the language code for each result will let the clustering engine choose the lexical resources (stemmer, stop words) appropriate for the language of each result and therefore significantly improve the quality of cluster labels. If all results are in the same language, the language can be set globally using Carrot2 http://doc.carrot2.org/#section.attribute.lingo.MultilingualClustering.defaultLanguage attribute.

The carrot.lcmap parameter can be used to map arbitrary strings to ISO 639 codes.

carrot.lcmap

<!> Solr3.6

Mapping of arbitrary strings into ISO 639 two-letter codes used by carrot.lang. Syntax of this parameter is the same as langid.map.lcmap.

carrot.produceSummary

When true, the carrot.snippet field (if no snippet field, then the carrot.title field) will be highlighted and the highlighted text will be used for clustering. Highlighting is recommended when the snippet field contains a lot of content. Highlighting can also increase the quality of clustering because the clustered content will get an additional query-specific context.

<!> Solr3.6 The number of snippets generated for clustering is determined by the highlighter's hl.snippets parameter and can be further overridden by carrot.summarySnippets.

carrot.fragSize

<!> Solr3.1

The frag size to use for highlighting. Meaningful only when carrot.produceSummary is true. If not specified, the default highlighting fragsize (hl.fragsize) will be used. If that isn't specified, then 100.

<!> In Solr versions 3.1.x, 3.2.x and 3.3.0 this parameter is incorrectly named carrot.fragzise. Solr versions 3.4.x and further use the correct parameter name carrot.fragSize.

carrot.summarySnippets

<!> Solr3.6

The number of summary snippets to generate for clustering. Meaningful only when carrot.produceSummary is true. If not specified, the default highlighting snippet count (hl.snippets) will be used. If that isn't specified, then 1.

carrot.numDescriptions

The maximum number of cluster labels to produce.

carrot.outputSubClusters

When true, output subclusters.

Currently, no Carrot2 algorithm can generate hierarchical clusters.

carrot.lexicalResourcesDir

<!> Solr3.2 <!> Solr4.0 <!> Solr4.5 (deprecated, use carrot.resourcesDir).

Specifies the directory from which Carrot2 should load its lexical resources, such as stop words and stop labels files. For more information on the syntax of these files, see the overview of lexical resources in Carrot2 manual.

The lexical resources directory can be either absolute ( <!> Solr3.4) or relative to ${solr.home}/conf. The default is: clustering/carrot, relative to ${solr.home}/conf.

If a specific Carrot2 resource (e.g. stopwords.en) is present in the specified dir, it will completely override the corresponding default one that ships with Carrot2.

Note: Carrot2 caches its lexical resources by default. The cache can be flushed either by restarting Solr or by appending the &reload-resources=true parameter to the request URL. Please note that resource reloading significantly increases the clustering time, so it should not be used when running regular production queries.

carrot.resourcesDir

<!> Solr4.5

Specifies a directory with optional resources overriding Carrot2 defaults, much like carrot.lexicalResourcesDir. In addition to that, this folder may contain per-engine attribute XML files exported from the Carrot2 workbench and configuring each algorithm. An attribute file for an engine XYZ is expected to be named, by convention, XYZ-attributes.xml. See the default Solr example work an example configuration of STC, Lingo and bisecting k-means.

Carrot2-specific parameters

Parameters of a specific clustering algorithm, e.g. LingoClusteringAlgorithm.desiredClusterCountBase can also be specified. A complete list of attributes for each clustering algorithm is available in Carrot2 documentation:

You can specify clustering algorithm parameters both in solrconfig.xml and at request time, e.g.:

http://localhost:8983/solr/clustering?q=*:*&rows=10&LingoClusteringAlgorithm.desiredClusterCountBase=20

<!> Solr4.5 Starting with Solr 4.5, the preferred way of configuring clustering algorithms is to export an XML file with attributes from Carrot2 Workbench and place such a file in carrot.resourcesDir, named enginename-attributes.xml.

Performance impact

Enabling search results clustering can result in two broad categories of performance penalties:

  1. Increased cost of fetching a larger than usual number of results, e.g. 50 or 100.
  2. Additional computational cost of clustering performed on the retrieved results.

For simple queries, the clustering time will usually dominate the fetching time.

The performance impact of clustering can be lowered in several ways:

  1. Feed less content for clustering by:
    1. applying highlighting on long fields,

    2. performing clustering on document titles only.
  2. Use the STC clustering algorithm instead of Lingo. STC is much faster, but cluster labels may be worse than those from Lingo.
  3. Limit the number of results being clustered to e.g. 50. The lowest reasonable number is usually around 20.
  4. Tune the performance of Carrot2 algorithms.

On reasonably modern hardware (Core2 3GHz, X25-M), with 100 search results, about 600 characters each, the default Carrot2 clustering algorithm (Lingo) would add 100--250 ms to the query processing time. When clustering the same 100 results, but using only titles (about 60 characters each) and the STC algorithm, the clustering time drops to about 5--15 ms.

Tuning Carrot2 clustering

The easiest way to tune Carrot2 clustering for your specific data is to use a dedicated Carrot2 tool called Document Clustering Workbench. This way, you don't even need to configure search results clustering in Solr because processing will happen inside the Document Clustering Workbench.

  1. Download Carrot2 Document Clustering Workbench for your platform.

  2. Attach your Solr instance as a document source in the Workbench.

  3. When you can see search the search results from your Solr instance in the Workbench, you can proceed with:
    1. Tuning of stop words

    2. Tuning of stop labels

    3. Tuning of other attributes of the algorithms, e.g. to reduce the size of the Other Topics group or improve the clustering performance.

  4. To apply the the modified stopwords.* and stoplabels.* files to your Solr instance:

    1. <!> Solr3.2 <!> Solr4.0: copy the modified files to the directory configured by #carrot.lexicalResourcesDir, ${solr.home}/conf/clustering/carrot2 by default.

    2. <!> Solr1.4: make the modified files accessible in the classpath. If you're using the Solr example scripts, try putting the files in the example/resources folder (Jetty starter from start.jar adds all files from that folder to the classpath). Alternatively, you can overwrite the corresponding stopwords.* and stoplabels.* files directly in carrot2-mini-*.jar.

  5. To transfer the clustering algorithm parameters modified in the Workbench to Solr:
    1. Save the modified parameters in Carrot2 XML format from Workbench

    2. Use the following XSLT transform to convert them to entries ready for pasting into clustering component or request handler configuration:
      <?xml version="1.0" encoding="UTF-8" ?>
      <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
        <xsl:strip-space elements="*"/>
      
        <xsl:template match="/attribute-sets/attribute-set[@id = 'overridden-attributes']//attribute">
          <str name="{@key}"><xsl:value-of select="value/@value" /></str><xsl:text>
      </xsl:text>
        </xsl:template>
      
        <xsl:template match="label" />
      </xsl:stylesheet>

Document Clustering

<!> THIS IS NOT FULLY IMPLEMENTED YET.

The Document Clustering implementation is designed to cluster whole documents across a collection. This can be done as an offline task. Once the clustering is done, the clusters can be retrieved.

Document Clustering is handled by using an implementation of the DocumentClusteringEngine. To invoke one, pass in the engine name, just as in the search results example, and also pass in the clustering.collection parameter (i.e. &clustering.collection=true). While this isn't fully worked out yet, it is likely that implementations will spawn a thread (or use a thread pool) that will perform the clustering asynchronously, returning some sort of identifier by which the clusters can be retrieved when done. Subsequent calls that use the identifier will then either return the clusters or return a percent complete.

<!> TODO <!> We likely also need a way of returning the status of all clustering tasks, that is if we support more than one task at a time.

See also Mahout: http://lucene.apache.org/mahout, which has several clustering algorithms implemented.

ClusteringComponent (last edited 2013-09-12 11:43:24 by DawidWeiss)