Differences between revisions 4 and 5
Revision 4 as of 2013-05-31 17:30:10
Size: 5992
Editor: SteveRowe
Comment:
Revision 5 as of 2014-02-02 05:37:50
Size: 6162
Editor: Mark Miller
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
#refresh 5 https://cwiki.apache.org/confluence/display/solr/DocValues

{X} X-( This page is outdated and you will be redirected to the Solr Reference Guide. {X} X-(

{X} X-( This page is outdated and you will be redirected to the Solr Reference Guide. {X} X-(

DocValues is a new field option in <!> Solr4.2 to build a forward index for a field, for purposes of sorting, faceting, grouping, function queries, etc.

Introduction

With a search engine you typically build an inverted index (indexed="true") for a field: where values point to documents. DocValues is a way to build a forward index (docValues="true") so that documents point to values.

  1. What docvalues are:
    • NRT-compatible: These are per-segment datastructures built at index-time and designed to be efficient for the use case where data is changing rapidly.
    • Basic query/filter support: You can do basic term, range, etc queries on docvalues fields without also indexing them, but these are constant-score only and typically slower. If you care about performance and scoring, index the field too.
    • Better compression than fieldcache: Docvalues fields compress better than fieldcache, and "insanity" is impossible.
    • Able to store data outside of heap memory: You can specify a different docValuesFormat on the fieldType (docValuesFormat="Disk") to only load minimal data on the heap, keeping other data structures on disk.

  2. What docvalues are not:
    • Not a replacement for stored fields: These are unrelated to stored fields in every way and instead datastructures for search (sort/facet/group/join/scoring).
    • Not a huge improvement for a static index: If you have a completely static index, docvalues won't seem very interesting to you. On the other hand if you are fighting the fieldcache, read on.
    • Not for the risk-averse: The integration with Solr is very new and probably still has some exciting bugs!

Lucene's DocValues types

Lucene has four underlying types that a docvalues field can have. Currently Solr uses three of these:

  1. NUMERIC: a single-valued per-document numeric type. This is like having a large long[] array for the whole index, though the data is compressed based upon the values that are actually used.
    • For example, consider 3 documents with these values:
             doc[0] = 1005
             doc[1] = 1006
             doc[2] = 1005
      In this example the field would use around 1 bit per document, since that is all that is needed.
  2. SORTED: a single-valued per-document string type. This is like having a large String[] array for the whole index, but with an additional level of indirection. Each unique value is assigned a term number that represents its ordinal value. So each document really stores a compressed integer, and separately there is a "dictionary" mapping these term numbers back to term values.
    • For example, consider 3 documents with these values:
             doc[0] = "aardvark"
             doc[1] = "beaver"
             doc[2] = "aardvark"
      Value "aardvark" will be assigned ordinal 0, and "beaver" 1, creating these two data structures:
             doc[0] = 0
             doc[1] = 1
             doc[2] = 0
      
             term[0] = "aardvark"
             term[1] = "beaver"
  3. SORTED_SET: a multi-valued per-document string type. Its similar to SORTED, except each document has a "set" of values (in increasing sorted order). So it intentionally discards duplicate values (frequency) within a document and loses order within the document.
    • For example, consider 3 documents with these values:
             doc[0] = "cat", "aardvark", "beaver", "aardvark"
             doc[1] =
             doc[2] = "cat"
      Value "aardvark" will be assigned ordinal 0, "beaver" 1, and "cat" 2, creating these two data structures:
             doc[0] = [0, 1, 2]
             doc[1] = []
             doc[2] = [2]
      
             term[0] = "aardvark"
             term[1] = "beaver"
             term[2] = "cat"
  4. BINARY: a single-valued per-document byte[] array. This can be used for encoding custom per-document datastructures.

Solr's DocValues types

  1. StrField (multiValued=false): This uses the SORTED type behind the scenes. This is a good choice for a sort field.

    • Example:

      <field name="manu_exact" type="str" indexed="false" stored="false" docValues="true" default=""/>

  2. StrField (multiValued=true): This uses the SORTED_SET type behind the scenes.

    • Example:

      <field name="productCategories" type="str" indexed="false" stored="false" multiValued="true" docValues="true"/>

  3. TrieXXXField (multiValued=false): This uses the NUMERIC type behind the scenes. This is a good choice for a sort field or scoring factor using in function queries.
    • Example:

      <field name="popularity" type="int" indexed="false" stored="false" docValues="true" default="0"/>

  4. TrieXXXField (multiValued=true): This uses the SORTED_SET type behind the scenes, encoding the numeric values such that ordinals reflect numeric sort order.
    • Example:

      <field name="specialCodes" type="int" indexed="false" stored="false" multiValued="true" docValues="true"/>

Specifying a different Codec implementation

You can specify the docValuesFormat attribute on the fieldType to control the underlying implementation.

To enable per-field DocValues formats, SchemaCodecFactory must be configured in solrconfig.xml:

  •  <codecFactory class="solr.SchemaCodecFactory"/>

/!\ Note that only the default implementation is supported by future version of Lucene: if you try an alternative format, you may need to switch back to the default and rewrite your index (e.g. forceMerge) before upgrading.

  • docValuesFormat="Lucene42": This is the default, which loads everything into heap memory.

  • docValuesFormat="Disk": This implementation has a different layout, to try to keep most data on disk but with reasonable performance.

  • docValuesFormat="SimpleText": Plain-text, slow, and not for production.

Example of altering the codec implementation:

  •  <fieldType name="string_disk" class="solr.StrField" docValuesFormat="Disk" />

DocValues (last edited 2014-06-23 21:48:17 by HossMan)