Differences between revisions 6 and 7
Revision 6 as of 2014-02-02 05:43:21
Size: 6198
Editor: Mark Miller
Comment:
Revision 7 as of 2014-06-23 21:48:17
Size: 4341
Editor: HossMan
Comment: prune stuff coverd by ref guide
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
{{{#!wiki red/solid
{X} X-( '''This page is outdated and you should read about DocValues at the Solr Reference Guide instead:''' https://cwiki.apache.org/confluence/display/solr/DocValues. {X} X-(
{{{#!wiki important
This page exists for the Solr Community to share Tips, Tricks, and Advice about
[[https://cwiki.apache.org/solr/DocValues|DocValues]].

Reference material previously located on this page has been migrated to the
[[https://cwiki.apache.org/solr/|Official Solr Ref Guide]].
If you need help, please consult the ref guide for the version of Solr you are using
for the specific details about using [[https://cwiki.apache.org/solr/DocValues|this feature]].

If you'd like to share information about how you use this feature, please [[FrontPage#How_to_edit_this_Wiki|add it to this page]].
/* cwikimigrated */
Line 4: Line 13:

DocValues is a new field option in <!> Solr4.2 to build a forward index for a field, for purposes of sorting, faceting, grouping, function queries, etc.
Line 22: Line 29:
= Lucene's DocValues types = = Low Level Details =
Line 70: Line 78:

= Solr's DocValues types =
 1. StrField (multiValued=false): This uses the SORTED type behind the scenes. This is a good choice for a sort field.
  . Example:
  {{{<field name="manu_exact" type="str" indexed="false" stored="false" docValues="true" default=""/>}}}
 1. StrField (multiValued=true): This uses the SORTED_SET type behind the scenes.
  . Example:
  {{{<field name="productCategories" type="str" indexed="false" stored="false" multiValued="true" docValues="true"/>}}}
 1. TrieXXXField (multiValued=false): This uses the NUMERIC type behind the scenes. This is a good choice for a sort field or scoring factor using in function queries.
  . Example:
  {{{<field name="popularity" type="int" indexed="false" stored="false" docValues="true" default="0"/>}}}
 1. TrieXXXField (multiValued=true): This uses the SORTED_SET type behind the scenes, encoding the numeric values such that ordinals reflect numeric sort order.
  . Example:
  {{{<field name="specialCodes" type="int" indexed="false" stored="false" multiValued="true" docValues="true"/>}}}

= Specifying a different Codec implementation =

You can specify the {{{docValuesFormat}}} attribute on the fieldType to control the underlying implementation.

To enable per-field DocValues formats, {{{SchemaCodecFactory}}} must be configured in [[SolrConfigXml#codecFactory|solrconfig.xml]]:

 {{{
 <codecFactory class="solr.SchemaCodecFactory"/>
 }}}

/!\ Note that only the default implementation is supported by future version of Lucene: if you try an alternative format, you may need to switch back to the default and rewrite your index (e.g. forceMerge) before upgrading.

 * {{{docValuesFormat="Lucene42"}}}: This is the default, which loads everything into heap memory.
 * {{{docValuesFormat="Disk"}}}: This implementation has a different layout, to try to keep most data on disk but with reasonable performance.
 * {{{docValuesFormat="SimpleText"}}}: Plain-text, slow, and not for production.

Example of altering the codec implementation:

 {{{
 <fieldType name="string_disk" class="solr.StrField" docValuesFormat="Disk" />
 }}}

This page exists for the Solr Community to share Tips, Tricks, and Advice about DocValues.

Reference material previously located on this page has been migrated to the Official Solr Ref Guide. If you need help, please consult the ref guide for the version of Solr you are using for the specific details about using this feature.

If you'd like to share information about how you use this feature, please add it to this page.

Introduction

With a search engine you typically build an inverted index (indexed="true") for a field: where values point to documents. DocValues is a way to build a forward index (docValues="true") so that documents point to values.

  1. What docvalues are:
    • NRT-compatible: These are per-segment datastructures built at index-time and designed to be efficient for the use case where data is changing rapidly.
    • Basic query/filter support: You can do basic term, range, etc queries on docvalues fields without also indexing them, but these are constant-score only and typically slower. If you care about performance and scoring, index the field too.
    • Better compression than fieldcache: Docvalues fields compress better than fieldcache, and "insanity" is impossible.
    • Able to store data outside of heap memory: You can specify a different docValuesFormat on the fieldType (docValuesFormat="Disk") to only load minimal data on the heap, keeping other data structures on disk.

  2. What docvalues are not:
    • Not a replacement for stored fields: These are unrelated to stored fields in every way and instead datastructures for search (sort/facet/group/join/scoring).
    • Not a huge improvement for a static index: If you have a completely static index, docvalues won't seem very interesting to you. On the other hand if you are fighting the fieldcache, read on.
    • Not for the risk-averse: The integration with Solr is very new and probably still has some exciting bugs!

Low Level Details

Lucene has four underlying types that a docvalues field can have. Currently Solr uses three of these:

  1. NUMERIC: a single-valued per-document numeric type. This is like having a large long[] array for the whole index, though the data is compressed based upon the values that are actually used.
    • For example, consider 3 documents with these values:
             doc[0] = 1005
             doc[1] = 1006
             doc[2] = 1005
      In this example the field would use around 1 bit per document, since that is all that is needed.
  2. SORTED: a single-valued per-document string type. This is like having a large String[] array for the whole index, but with an additional level of indirection. Each unique value is assigned a term number that represents its ordinal value. So each document really stores a compressed integer, and separately there is a "dictionary" mapping these term numbers back to term values.
    • For example, consider 3 documents with these values:
             doc[0] = "aardvark"
             doc[1] = "beaver"
             doc[2] = "aardvark"
      Value "aardvark" will be assigned ordinal 0, and "beaver" 1, creating these two data structures:
             doc[0] = 0
             doc[1] = 1
             doc[2] = 0
      
             term[0] = "aardvark"
             term[1] = "beaver"
  3. SORTED_SET: a multi-valued per-document string type. Its similar to SORTED, except each document has a "set" of values (in increasing sorted order). So it intentionally discards duplicate values (frequency) within a document and loses order within the document.
    • For example, consider 3 documents with these values:
             doc[0] = "cat", "aardvark", "beaver", "aardvark"
             doc[1] =
             doc[2] = "cat"
      Value "aardvark" will be assigned ordinal 0, "beaver" 1, and "cat" 2, creating these two data structures:
             doc[0] = [0, 1, 2]
             doc[1] = []
             doc[2] = [2]
      
             term[0] = "aardvark"
             term[1] = "beaver"
             term[2] = "cat"
  4. BINARY: a single-valued per-document byte[] array. This can be used for encoding custom per-document datastructures.

DocValues (last edited 2014-06-23 21:48:17 by HossMan)