This page exists for the Solr Community to share Tips, Tricks, and Advice about DocValues.

Reference material previously located on this page has been migrated to the Official Solr Ref Guide. If you need help, please consult the ref guide for the version of Solr you are using for the specific details about using this feature.

If you'd like to share information about how you use this feature, please add it to this page.
/* cwikimigrated */

Introduction

With a search engine you typically build an inverted index (indexed="true") for a field: where values point to documents. DocValues is a way to build a forward index (docValues="true") so that documents point to values.

  1. What docvalues are:
    • NRT-compatible: These are per-segment datastructures built at index-time and designed to be efficient for the use case where data is changing rapidly.
    • Basic query/filter support: You can do basic term, range, etc queries on docvalues fields without also indexing them, but these are constant-score only and typically slower. If you care about performance and scoring, index the field too.
    • Better compression than fieldcache: Docvalues fields compress better than fieldcache, and "insanity" is impossible.
    • Able to store data outside of heap memory: You can specify a different docValuesFormat on the fieldType (docValuesFormat="Disk") to only load minimal data on the heap, keeping other data structures on disk.
  2. What docvalues are not:
    • Not a replacement for stored fields: These are unrelated to stored fields in every way and instead datastructures for search (sort/facet/group/join/scoring).
    • Not a huge improvement for a static index: If you have a completely static index, docvalues won't seem very interesting to you. On the other hand if you are fighting the fieldcache, read on.
    • Not for the risk-averse: The integration with Solr is very new and probably still has some exciting bugs!

Low Level Details

Lucene has four underlying types that a docvalues field can have. Currently Solr uses three of these:

  1. NUMERIC: a single-valued per-document numeric type. This is like having a large long[] array for the whole index, though the data is compressed based upon the values that are actually used.

    • For example, consider 3 documents with these values:
             doc[0] = 1005
             doc[1] = 1006
             doc[2] = 1005
      
      In this example the field would use around 1 bit per document, since that is all that is needed.
  2. SORTED: a single-valued per-document string type. This is like having a large String[] array for the whole index, but with an additional level of indirection. Each unique value is assigned a term number that represents its ordinal value. So each document really stores a compressed integer, and separately there is a "dictionary" mapping these term numbers back to term values.

    • For example, consider 3 documents with these values:
             doc[0] = "aardvark"
             doc[1] = "beaver"
             doc[2] = "aardvark"
      
      Value "aardvark" will be assigned ordinal 0, and "beaver" 1, creating these two data structures:
             doc[0] = 0
             doc[1] = 1
             doc[2] = 0
      
             term[0] = "aardvark"
             term[1] = "beaver"
      

  3. SORTED_SET: a multi-valued per-document string type. Its similar to SORTED, except each document has a "set" of values (in increasing sorted order). So it intentionally discards duplicate values (frequency) within a document and loses order within the document.
    • For example, consider 3 documents with these values:
             doc[0] = "cat", "aardvark", "beaver", "aardvark"
             doc[1] =
             doc[2] = "cat"
      
      Value "aardvark" will be assigned ordinal 0, "beaver" 1, and "cat" 2, creating these two data structures:
             doc[0] = [0, 1, 2]
             doc[1] = []
             doc[2] = [2]
      
             term[0] = "aardvark"
             term[1] = "beaver"
             term[2] = "cat"
      

  4. BINARY: a single-valued per-document byte[] array. This can be used for encoding custom per-document datastructures.

  • No labels