Differences between revisions 62 and 63
Revision 62 as of 2013-09-12 11:43:24
Size: 23839
Editor: DawidWeiss
Revision 63 as of 2015-08-24 11:56:51
Size: 177
Editor: DawidWeiss
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Clustering Component =

This [[SearchComponent|search component]] can cluster both [[#Search_Results_Clustering|search results]] and [[#Document_Clustering|documents]]. In case you're wondering what clustering is good for, think of it as a quick way to summarize a whole bunch of results/documents, or as a way to group together semantically related results/documents. See http://en.wikipedia.org/wiki/Data_clustering for more background, and [[http://search.carrot2.org/stable/search?query=apache|Carrot2 demo]] for a live example. More information about Solr-Carrot^2^ integration strategies, including per-Solr-version examples are available at [[http://carrot2.github.io/solr-integration-strategies/|http://carrot2.github.io/solr-integration-strategies/]]

Jump to: [[#Quick_Start|Quick Start]] | [[#Configuration|Configuration]] | [[#Parameters|Parameters]]

'''Note''': This code is marked as experimental, the APIs and responses are subject to change in future releases. See [[https://issues.apache.org/jira/browse/SOLR/component/12313050|clustering component issues in JIRA]] for discussions around the development of this feature.


= Overview =
The !ClusteringComponent implements a pluggable approach that allows for the implementation of any clustering engine. The component is responsible for taking in the request, identifying the clustering engine to be used (a !SolrClusteringEngine implementation) and then delegating the work to that engine. Once the engine is done, the results are added to the response.

Search results clustering implementation (!CarrotClusteringEngine) is based on the [[http://project.carrot2.org|Carrot2]] framework.

<!> [[Solr3.1]]
The !ClusteringComponent supports distributed processing, except the [[#carrot.produceSummary|carrot.produceSummary]] parameter (please see [[http://issues.apache.org/jira/browse/SOLR-2282|SOLR-2282]] about the restriction).

= Quick Start =

 1. <!> [[Solr1.4]] [[#Downloading_dependencies|Download library dependencies]]
 1. Run the example Solr configuration using the following commands:

$ cd example
$ java -Dsolr.clustering.enabled=true -jar start.jar

 The command uses the same configuration and index as the main Solr example, but it additionally enables the !ClusteringComponent and a dedicated !SearchHandler configured to use that component.

 1. In a different window, add some docs using the post tool in the {{{exampledocs}}} directory (if you haven't already):

$ cd example/exampledocs
$ ./post.sh *.xml

 1. Try a query using the {{{clustering}}} handler:


 This should yield results that include cluster information at the bottom of the response:

<arr name="clusters">
    <arr name="labels">
    <double name="score">3.1654221261111397</double>
    <arr name="docs">
    <arr name="labels">
      <str>Car Power Adapter</str>
    <arr name="labels">
      <str>Hard Drive</str>
    <arr name="labels">
      <str>USB 2.0</str>
    <arr name="labels">
      <str>Other Topics</str>
    <double name="score">0.0</double>
    <bool name="other-topics">true</bool>
    <arr name="docs">

 Clusters produced by Carrot^2^ group the results into different product categories: iPad, Car Power Adapter, Hard Drive, USB 2.0. A few things to notice:
   * Each cluster has a `score` that indicates the "goodness" of the cluster. The score is algorithm-specific and is meaningful only in relation to the scores of other clusters in the same set. In other words, if cluster A has higher score than cluster B, the algorithm "thinks" cluster A is better, e.g. has a better label and/or more coherent document set.
   * Each cluster has an array of identifiers of documents contained in it.
   * Depending on the quality of input documents, some clusters may not make much sense
   * Some documents may be left in the Other Topics group. Such a group is marked with the `other-topics` property set to `true`.

= Installation =

== Downloading dependencies ==

<!> [[Solr3.1]]
<!> [[Solr4.0]]
Carrot^2^ is fully integrated into Solr and does not require special downloads.

<!> [[Solr1.4]]
Due to some dependencies on LGPL libraries for the Carrot^2^ implementation, we cannot package a complete binary solution (with all the dependencies). To get the Carrot^2^ solution, on the command line in the {{{contrib/clustering}}} directory, run {{{ant get-libraries}}}. This will create a {{{downloads}}} directory under the {{{lib}}} directory for the downloaded JARs.

== Installation on Tomcat ==

 1. Copy all the JAR files from {{{contrib/clustering}}} ( <!> [[Solr1.4]] and {{{contrib/clustering/downloads}}}, [[#Downloading_dependencies|see above]]) into {{{${solr.home}/lib}}}.
 1. Copy the {{{dist/apache-solr-clustering-*.jar}}} file into your {{{${solr.home}/lib}}} directory. This is needed to run the clustering component.
 1. Enable the clustering component by adding {{{-Dsolr.clustering.enabled=true}}} to {{{$CATALINA_OPTS}}}.
 1. Restart Tomcat and verify that the log file is error free.

The default {{{${solr.home}/conf/solrconfig.xml}}} contains a preconfigured !ClusteringComponent, but you may need to edit the parameters, such as [[#carrot.title|carrot.title]] and [[#carrot.snippet|carrot.snippet]], to match your schema.

= Configuration =

 1. Add !ClusteringComponent to your {{{solrconfig.xml}}}, just like any other SearchComponent:

<searchComponent class="org.apache.solr.handler.clustering.ClusteringComponent" name="clustering">
  <lst name="engine">
    <str name="name">default</str>
    <str name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>

    <!-- Engine-specific parameters -->
    <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>

 Within the clustering search component declaration you can configure several engines with different clustering algorithms and [[#Carrot2-specific_parameters|algorithm-specific parameters]]. Each of these engines can be selected at query time using the [[#clustering.engine|clustering.engine]] parameter.

 1. Reference the clustering component in your request handler:

<requestHandler name="standard" class="solr.SearchHandler" default="true">
  <lst name="defaults">
    <str name="echoParams">explicit</str>

    <bool name="clustering">true</bool>
    <str name="clustering.engine">default</str>
    <bool name="clustering.results">true</bool>

    <!-- Fields to cluster on -->
    <str name="carrot.title">name</str>
    <str name="carrot.snippet">features</str>
  <arr name="last-components">

 The request handler configuration needs to set a number of common parameters, such as the type of clustering ([[#clustering.results|clustering.results]]), and a number of engine-specific parameters, e.g. fields to cluster on ([[#carrot.title|carrot.title]], [[#carrot.snippet|carrot.snippet]]).

<!> [[Solr4.5]] Starting with Solr 4.5 there is no need to declare a "default" engine. The first declared engine becomes the default one and it can be overridden with [[#clustering.engine|clustering.engine]] request parameter.

= Parameters =

This section lists common parameters that apply to both search results and document clustering. Please also see the [[#Parameters|search results clustering parameters]].

== clustering ==

When `true`, clustering is enabled.

== clustering.engine ==

The clustering engine to use. If not specified, the engine named `default` will be used (or, starting with <!> [[Solr4.5]], the first declared engine).

== clustering.results ==

When `true`, the component will perform [[#Search_Results_Clustering|clustering of search results]].

== clustering.collection ==

When `true`, the component will perform [[#Document_Clustering|clustering of the whole document index]].

= Search Results Clustering =

Search results clustering is a technique of post-processing of search results that aims to group them into thematically related categories. For example, when [[http://search.carrot2.org/stable/search?query=apache|clustering web search results for the 'apache' query]], one will likely see groups related to the Apache Software Foundation, Apache Web Server, but also groups about Apache County, Apache Indians or the Attack Helicopter.

Solr search results clustering is based on the [[http://project.carrot2.org|Carrot2]] real-time document clustering engine. Carrot^2^ offers two specialized search results clustering algorithms that emphasize the quality of cluster labels.

== Input for clustering ==

Carrot^2^ is best suited for clustering small-to-medium collections of short documents. While it may work for longer documents, processing times may be too long to meet on-line clustering requirements.

Carrot^2^ assumes that each search result provided on input can consist of three types of fields: [[#carrot.title|document title]], [[#carrot.snippet|document content/snippet]] and [[#carrot.url|URL]]. Document title is required, content/snippet and URL are optional. The reason to distinguish between the document's title and content is that Carrot^2^ can give more weight to the titles, which increases the quality of clusters and labels. Carrot^2^ needs at least about 20 search results to generate meaningful clusters. For more information, please see [[http://download.carrot2.org/stable/manual/#section.advanced-topics.fine-tuning.input-documents-characteristics|the desired qualities of the documents for clustering in Carrot2 manual]].

'''Note''': Carrot^2^ can only perform clustering on stored fields. The reason for this is that Carrot^2^ aims to create meaningful cluster labels by using phrases (sequences of words) taken directly from the documents' text. The easiest way of providing input for such a process is feeding Carrot^2^ with raw (stored) document content. As a result, character and token filters are currently ignored. There are plans to implement support for character and selected token filters during clustering: https://issues.apache.org/jira/browse/SOLR-2917.

== Parameters ==

=== carrot.algorithm ===

The fully qualified class name of the Carrot^2^ clustering algorithm to use. Currently, the following algorithms are available:

 * `org.carrot2.clustering.lingo.LingoClusteringAlgorithm`
 * `org.carrot2.clustering.stc.STCClusteringAlgorithm`
 * `org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm`

Please see [[http://project.carrot2.org/algorithms.html]] for the characteristics of these algorithms and [[http://download.carrot2.org/stable/manual/#section.advanced-topics.fine-tuning.choosing-algorithm|clustering algorithm choice guidance in Carrot2 manual]].

'''Note:''' This parameter must be specified in the clustering component configuration (in the engine section) and cannot be overridden at query time.

=== carrot.title ===

The Solr field ( <!> [[Solr3.6]]: comma- or space-separated list of fields) that the clustering engine should treat as the hit document's title. It must be a stored field (or fields).

Carrot^2^ will give more weight to the content of this field compared to [[#carrot.snippet|carrot.snippet]]. For best results, the field should contain concise, noise-free content.

If your schema does not distinguish the document's title and content, you can provide your content in [[#carrot.title|carrot.title]] and leave [[#carrot.snippet|carrot.snippet]] empty.

=== carrot.snippet ===

The Solr field ( <!> [[Solr3.6]]: comma- or space-separated list of fields) that the clustering engine should treat as the hit document's content. It must be a stored field (or fields).

For best results, the snippet should contain a summary of the document, e.g. an abstract or the first content paragraph. Very long snippet fields will significantly increase the clustering time, unless [[#carrot.produceSummary|carrot.produceSummary]] is enabled.

=== carrot.url ===

The Solr field that the clustering engine should treat as the hit document's target URL. Must be a stored field. This mapping is optional.

The URL field is currently not used by the Carrot^2^ algorithms.

=== carrot.lang ===

<!> [[Solr3.6]]

The Solr field that the clustering engine should treat as the search results's ISO 639 two-letter language code. In case of multilingual result sets, providing the language code for each result will let the clustering engine choose the lexical resources (stemmer, stop words) appropriate for the language of each result and therefore significantly improve the quality of cluster labels. If all results are in the same language, the language can be set globally using Carrot2 [[http://doc.carrot2.org/#section.attribute.lingo.MultilingualClustering.defaultLanguage||MultilingualClustering.defaultLanguage]] attribute.

The [[#carrot.lcmap|carrot.lcmap]] parameter can be used to map arbitrary strings to ISO 639 codes.

=== carrot.lcmap ===

<!> [[Solr3.6]]

Mapping of arbitrary strings into ISO 639 two-letter codes used by [[#carrot.lang|carrot.lang]]. Syntax of this parameter is the same as [[http://wiki.apache.org/solr/LanguageDetection#langid.map.lcmap|langid.map.lcmap]].

=== carrot.produceSummary ===

When `true`, the [[#carrot.snippet|carrot.snippet]] field (if no snippet field, then the [[#carrot.title|carrot.title]] field) will be highlighted and the highlighted text will be used for clustering. Highlighting is recommended when the snippet field contains a lot of content. Highlighting can also increase the quality of clustering because the clustered content will get an additional query-specific context.

<!> [[Solr3.6]] The number of snippets generated for clustering is determined by the highlighter's {{{hl.snippets}}} parameter and can be further overridden by [[#carrot.summarySnippets |carrot.summarySnippets]].

=== carrot.fragSize ===

<!> [[Solr3.1]]

The frag size to use for highlighting. Meaningful only when [[#carrot.produceSummary|carrot.produceSummary]] is `true`. If not specified, the default highlighting fragsize (`hl.fragsize`) will be used. If that isn't specified, then 100.

<!> In Solr versions 3.1.x, 3.2.x and 3.3.0 this parameter is [[https://issues.apache.org/jira/browse/SOLR-2692|incorrectly named]] {{{carrot.fragzise}}}. Solr versions 3.4.x and further use the correct parameter name {{{carrot.fragSize}}}.

=== carrot.summarySnippets ===

<!> [[Solr3.6]]

The number of summary snippets to generate for clustering. Meaningful only when [[#carrot.produceSummary|carrot.produceSummary]] is `true`. If not specified, the default highlighting snippet count (`hl.snippets`) will be used. If that isn't specified, then 1.

=== carrot.numDescriptions ===

The maximum number of cluster labels to produce.

=== carrot.outputSubClusters ===

When `true`, output subclusters.

Currently, no Carrot^2^ algorithm can generate hierarchical clusters.

=== carrot.lexicalResourcesDir ===

<!> [[Solr3.2]]
<!> [[Solr4.0]]
<!> [[Solr4.5]] (deprecated, use [[#carrot.resourcesDir|carrot.resourcesDir]]).

Specifies the directory from which Carrot^2^ should load its lexical resources, such as stop words and stop labels files. For more information on the syntax of these files, see the [[http://download.carrot2.org/head/manual/#chapter.lexical-resources|overview of lexical resources in Carrot2 manual]].

The lexical resources directory can be either absolute ( <!> [[Solr3.4]]) or relative to `${solr.home}/conf`. The default is: `clustering/carrot`, relative to `${solr.home}/conf`.

If a specific Carrot^2^ resource (e.g. `stopwords.en`) is present in the specified dir, it will completely override the corresponding default one that ships with Carrot^2^.

'''Note:''' Carrot^2^ caches its lexical resources by default. The cache can be flushed either by restarting Solr or by appending the `&reload-resources=true` parameter to the request URL. Please note that resource reloading significantly increases the clustering time, so it should not be used when running regular production queries.

=== carrot.resourcesDir ===

<!> [[Solr4.5]]

Specifies a directory with optional resources overriding Carrot^2^ defaults, much like [[#carrot.lexicalResourcesDir|carrot.lexicalResourcesDir]]. In addition to that, this folder may contain per-engine attribute XML files exported from the Carrot^2^ workbench and configuring each algorithm. An attribute file for an engine `XYZ` is expected to be named, by convention, `XYZ-attributes.xml`. See the default Solr example work an example configuration of STC, Lingo and bisecting k-means.

=== Carrot2-specific parameters ===

Parameters of a specific clustering algorithm, e.g. `LingoClusteringAlgorithm.desiredClusterCountBase` can also be specified. A complete list of attributes for each clustering algorithm is available in Carrot2 documentation:

 * [[http://download.carrot2.org/head/manual/#section.component.lingo|Lingo clustering algorithm parameters]]
 * [[http://download.carrot2.org/head/manual/#section.component.stc|STC clustering algorithm parameters]]
 * [[http://download.carrot2.org/stable/manual/#section.component.kmeans|K-means clustering algorithm parameters]]

You can specify clustering algorithm parameters both in `solrconfig.xml` and at request time, e.g.:


<!> [[Solr4.5]] Starting with Solr 4.5, the preferred way of configuring clustering algorithms is to export an XML file with attributes from Carrot^2^ Workbench and place such a file in [[#carrot.resourcesDir|carrot.resourcesDir]], named `enginename-attributes.xml`.

== Performance impact ==

Enabling search results clustering can result in two broad categories of performance penalties:

 1. Increased cost of fetching a larger than usual number of results, e.g. 50 or 100.
 1. Additional computational cost of clustering performed on the retrieved results.

For simple queries, the clustering time will usually dominate the fetching time.

The performance impact of clustering can be lowered in several ways:

 1. Feed less content for clustering by:
   a. [[#carrot.produceSummary|applying highlighting on long fields]],
   a. performing clustering on document titles only.
 1. Use the STC clustering algorithm instead of Lingo. STC is much faster, but cluster labels may be worse than those from Lingo.
 1. Limit the number of results being clustered to e.g. 50. The lowest reasonable number is usually around 20.
 1. [[http://download.carrot2.org/stable/manual/#section.advanced-topics.fine-tuning.performance|Tune the performance of Carrot2 algorithms]].

On reasonably modern hardware (Core2 3GHz, X25-M), with 100 search results, about 600 characters each, the default Carrot^2^ clustering algorithm (Lingo) would add 100--250 ms to the query processing time. When clustering the same 100 results, but using only titles (about 60 characters each) and the STC algorithm, the clustering time drops to about 5--15 ms.

== Tuning Carrot2 clustering ==

The easiest way to tune Carrot^2^ clustering for your specific data is to use a dedicated Carrot^2^ tool called Document Clustering Workbench. This way, you don't even need to configure search results clustering in Solr because processing will happen inside the Document Clustering Workbench.

 1. [[http://project.carrot2.org/download.html|Download Carrot2 Document Clustering Workbench]] for your platform.
 1. [[http://download.carrot2.org/head/manual/#section.getting-started.solr|Attach]] your Solr instance as a document source in the Workbench.
 1. When you can see search the search results from your Solr instance in the Workbench, you can proceed with:
   a. [[http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words|Tuning of stop words]]
   a. [[http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps|Tuning of stop labels]]
   a. Tuning of [[http://download.carrot2.org/head/manual/#section.component.lingo|other attributes of the algorithms]], e.g. to [[http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.reducing-other-topics|reduce the size of the Other Topics group]] or [[http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.performance|improve the clustering performance]].
 1. To apply the the modified `stopwords.*` and `stoplabels.*` files to your Solr instance:
   a. <!> [[Solr3.2]] <!> [[Solr4.0]]: copy the modified files to the directory configured by [[#carrot.lexicalResourcesDir]], `${solr.home}/conf/clustering/carrot2` by default.
   a. <!> [[Solr1.4]]: make the modified files accessible in the classpath. If you're using the Solr example scripts, try putting the files in the `example/resources` folder (Jetty starter from `start.jar` adds all files from that folder to the classpath). Alternatively, you can overwrite the corresponding `stopwords.*` and `stoplabels.*` files directly in `carrot2-mini-*.jar`.
 1. To transfer the clustering algorithm parameters modified in the Workbench to Solr:
   a. [[http://download.carrot2.org/head/manual/#section.customizing.component-suites-and-attributes.saving-with-workbench|Save the modified parameters in Carrot2 XML format]] from Workbench
   a. Use the following XSLT transform to convert them to entries ready for pasting into clustering component or request handler configuration:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:strip-space elements="*"/>

  <xsl:template match="/attribute-sets/attribute-set[@id = 'overridden-attributes']//attribute">
    <str name="{@key}"><xsl:value-of select="value/@value" /></str><xsl:text>

  <xsl:template match="label" />

= Document Clustering =

The Document Clustering implementation is designed to cluster whole documents across a collection. This can be done as an offline task. Once the clustering is done, the clusters can be retrieved.

Document Clustering is handled by using an implementation of the !DocumentClusteringEngine. To invoke one, pass in the engine name, just as in the search results example, and also pass in the [[#clustering.collection|clustering.collection]] parameter (i.e. &clustering.collection=true). While this isn't fully worked out yet, it is likely that implementations will spawn a thread (or use a thread pool) that will perform the clustering asynchronously, returning some sort of identifier by which the clusters can be retrieved when done. Subsequent calls that use the identifier will then either return the clusters or return a percent complete.

<!> TODO <!> We likely also need a way of returning the status of all clustering tasks, that is if we support more than one task at a time.

See also Mahout: http://lucene.apache.org/mahout, which has several clustering algorithms implemented.
For up-to-date information regarding the Clustering Component in Solr 5.x, see the [[https://cwiki.apache.org/confluence/display/solr/Result+Clustering|Solr Reference Guide]].

For up-to-date information regarding the Clustering Component in Solr 5.x, see the Solr Reference Guide.

ClusteringComponent (last edited 2015-08-24 11:56:51 by DawidWeiss)