Differences between revisions 8 and 9
Revision 8 as of 2008-05-20 16:59:40
Size: 5895
Editor: DNab422a4b
Comment: added missing comma, not sure if field cache should say "first" indexed value
Revision 9 as of 2008-12-03 00:22:06
Size: 5188
Editor: HossMan
Comment: updates to reflect new facet.method param and UnInvertedField
Deletions are marked like this. Additions are marked like this.
Line 40: Line 40:
Any number of [:SimpleFacetParameters#facet.field:facet.field] parameters can be passed to the request handler. For each facet.field, one of two approaches will be used based on the Field definiton in schema.xml:
  
    * '''Field Queries''': If the facet field is defined in the schema as multi-valued, boolean, or tokenized, then every indexed value for the field will be iterated and a facet query will be executed and cached (as described above). This is excellent for fields where there is a small set of distinct values. For example, faceting on a field with U.S. States e.g. `Alabama, Alaska, ... Wyoming` would lead to fifty cached queries which would be used over and over again.  It also works in the case when the facet field can have multiple values for each document. However, it requires excessive amounts of memory and time when the number of field values is large, and especially when it exceeds the filter cache size defined in [:SolrCaching#filterCache:filterCache]
Any number of [:SimpleFacetParameters#facet.field:facet.field] parameters can be passed to the request handler. For each facet.field, one of two approaches will be used based on the [:SimpleFacetParameters#facet.method:facet.method] or the field type:

    * '''Enum Based Field Queries''': If {{{facet.method=enum}}} or the field is defined in the schema as boolean, then every indexed value for the field will be iterated and a facet query will be executed and cached (as described above). This is excellent for fields where there is a small set of distinct values. For example, faceting on a field with U.S. States e.g. `Alabama, Alaska, ... Wyoming` would lead to fifty cached queries which would be used over and over again. However, it requires excessive amounts of memory and time when the number of field values is large, and especially when it exceeds the filter cache size defined in [:SolrCaching#filterCache:filterCache]
Line 44: Line 44:
    * '''Field Cache''': If the facet field is not tokenized, not multi-valued, and not boolean, then a field-cache approach will be used. This is currently implemented with the Lucene [http://lucene.apache.org/java/docs/api/org/apache/lucene/search/FieldCache.html FieldCache] mechanism used for results sorting. An array of integers (one for every document in the index) is allocated, pre-filled with the first (only?) indexed value for that field in each document (offset into a table of strings for fields indexed as strings), and cached. Every time that facet.field is used for faceting a query, all the document IDs resulting from the query are looked up in the field cache and any value found has its tally incremented. This is excellent for situations where the number of indexed values for the field is too large to be practical using the field queries mechanism, such as faceting against authors or titles. However it is currently much slower and more memory-intensive than the field query mechanism for fields with a small number of values.

Note that at this time there is no way to manually control whether facet.field is handled via field queries or field cache, other than defining in the schema whether the field is single- or multi-valued and the analyzer used: `solr.TextField` is always tokenized while `solr.StrField` is never tokenized. Control may be improved in the future, along with a means to handle multi-valued fields with a variant of the Field Cache mechanism.
    * '''Field Cache''': If {{{facet.method=fc}}} then a field-cache approach will be used. This is currently implemented using either the the Lucene [http://lucene.apache.org/java/docs/api/org/apache/lucene/search/FieldCache.html FieldCache] or (starting in Solr 1.4) an !UnInvertedField if the field is multivalued or tokenized. Every time that {{{facet.field}}} is used for faceting a query, all the document IDs resulting from the query are looked up in the cache and any value found has its tally incremented. This is excellent for situations where the number of indexed values for the field is too large to be practical using the field queries mechanism, such as faceting against authors or titles. However it is currently much slower and more memory-intensive than the field query mechanism for fields with a small number of values.

Solr provides a [http://lucene.apache.org/solr/docs/api/org/apache/solr/request/SimpleFacets.html Simple Faceting toolkit] which can be reused by various Request Handlers to include "Facet counts" based on some simple criteria. Both the StandardRequestHandler and the DisMaxRequestHandler currently use these utilities. Detailed descriptions of the parameters used to control faceting can be found (along with several examples) at [SimpleFacetParameters].

This page briefly provides some general background information:

Facet Indexing

Faceting is done on indexed rather than stored values. This is because the primary use for faceting is drill-down into a subset of hits resulting from a query, and so the chosen facet value is used to construct a filter query which literally matches that value in the index. For the stock Solr request handlers this is done by adding an fq=<facet-field>:<quoted facet-value> parameter and resubmitting the query.

Because faceting fields are often specified to serve two purposes, human-readable text and drill-down query value, they are frequently indexed differently from fields used for searching and sorting:

  • They are not tokenized into separate words
  • They are not mapped into lower case
  • Human-readable punctuation is not removed (other than double-quotes)
  • There is often no need to store them, since stored values would look much like indexed values and the faceting mechanism is used for value retrieval.
  • Depending on how the field is defined, the SimpleFacets mechanism may only allow for a single value per field per document (see below)

As an example, if I had an "author" field with a list of authors, such as:

  • Schildt, Herbert; Wolpert, Lewis; Davies, P.

I might want to index the same data differently in three different fields (perhaps using the Solr [:SchemaXml#Copy Fields:copyField] directive):

  • For searching: Tokenized, case-folded, punctuation-stripped:
    • schildt / herbert / wolpert / lewis / davies / p
  • For sorting: Untokenized, case-folded, punctuation-stripped:
    • schildt herbert wolpert lewis davies p
  • For faceting: Primary author only, using a solr.StringField:

    • Schildt, Herbert

Then when the user drills down on the "Schildt, Herbert" string I would reissue the query with an added fq=author:"Schild, Herbert" parameter. If you wanted to drill-down or query by multiple authors you would add more 'fq' parameters as needed, e.g. fq=author:"Schield, Herbet"&fq=author:"Wolpert, Lewis".

Facet Operation

Currently SimpleFacets has 3 modes of operation, selected by a combination of SimpleFacetParameters, Response Handler parameters and [:SchemaXml: schema.xml] Field definitions:

FacetQueries

Any number of [:SimpleFacetParameters#facet.query:facet.query] parameters can be passed to the request handler. Each distinct facet.query will first be executed against the entire index, with the results cached as a hashed set (if fewer than hashDocSet) or a bit set (if greater) of document IDs (see [:SolrCaching#The hashDocSet Max Size:hashDocSet]). Then, every time that facet.query is used for faceting a query, the cached set will be intersected against the set of document IDs returned by the query to count the number of documents for which the facet.query condition is true.

FacetFields

Any number of [:SimpleFacetParameters#facet.field:facet.field] parameters can be passed to the request handler. For each facet.field, one of two approaches will be used based on the [:SimpleFacetParameters#facet.method:facet.method] or the field type:

  • Enum Based Field Queries: If facet.method=enum or the field is defined in the schema as boolean, then every indexed value for the field will be iterated and a facet query will be executed and cached (as described above). This is excellent for fields where there is a small set of distinct values. For example, faceting on a field with U.S. States e.g. Alabama, Alaska, ... Wyoming would lead to fifty cached queries which would be used over and over again. However, it requires excessive amounts of memory and time when the number of field values is large, and especially when it exceeds the filter cache size defined in [:SolrCaching#filterCache:filterCache]

  • Field Cache: If facet.method=fc then a field-cache approach will be used. This is currently implemented using either the the Lucene [http://lucene.apache.org/java/docs/api/org/apache/lucene/search/FieldCache.html FieldCache] or (starting in Solr 1.4) an UnInvertedField if the field is multivalued or tokenized. Every time that facet.field is used for faceting a query, all the document IDs resulting from the query are looked up in the cache and any value found has its tally incremented. This is excellent for situations where the number of indexed values for the field is too large to be practical using the field queries mechanism, such as faceting against authors or titles. However it is currently much slower and more memory-intensive than the field query mechanism for fields with a small number of values.

SolrFacetingOverview (last edited 2014-03-06 14:39:38 by 84)