The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.
A
sample Solr schema.xml with detailed comments can be found in the Source Repository.
:TODO:
we should try to make a DTD for the schema
Data Types
The <types> section allows you define a list of <fieldtype> declarations you wish to use in your schema, along with the underlying Solr class that should be used for that type, as well as the default options you want for fields that use that type.
Any subclass of
FieldType may be used as a field type class, using either its full package name, or the "solr" alias if it is in the default Solr package. For common numeric types (integer, float, etc...) there are multiple implementations provided depending on your needs, please see SolrPlugins for information on how to ensure that your own custom Field Types can be loaded into Solr.
Common options that field types can have are...
sortMissingLast=true|false
sortMissingFirst=true|false
indexed=true|false
stored=true|false
multiValued=true|false
omitNorms=true|false
positionIncrementGap=N
TextFields can also support Analyzers with highly configurable Tokenizers and Token Filters.
:TODO:
do omitNorms and positionIncrementGap have any meaning for non TextFields?
Field types that store text (TextField, StrField) support compression of stored contents:
compressed=true|false
compressThreshold=<integer>
compressThreshold is the minimum length required for text compression to be invoked. This applies only if compressed=true; a common pattern is to set compressThreshold on the field type definition, and turn compression on and off in the individual field definitions.
Fields
The <fields> section is where you list the individual <field> declarations you wish to use in your documents. Each <field> has a name that you will use to reference it when adding documents or executing searches, and an associated type which identifies the name of the fieldtype you wish to use for this field. There are various field options that apply to a field. These can be set in the field type declarations, and can also be overridden at an individual field's declaration.
Common field options
Common options that fields can have are...
indexed=true|false
True if this field should be "indexed". If (and only if) a field is indexed, then it is searchable, sortable, and facetable.
stored=true|false
True if the value of the field should be retrievable during a search
compressed=true|false
compressThreshold=<integer>
multiValued=true|false
True if this field may contain multiple values per document, i.e. if it can appear multiple times in a document
omitNorms=true|false
This is arguably an advanced option.
Set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms.
See also FieldOptionsByUseCase, which discusses how these options should be set in various circumstances. See SolrPerformanceFactors for how different options can affect Solr performance.
Dynamic fields
One of the powerful features of Lucene is that you don't have to pre-define every field when you first create your index. Even though Solr provides strong datatyping for fields, it still preserves that flexibility using "Dynamic Fields". Using <dynamicField> declarations, you can create field rules that Solr will use to understand what datatype should be used whenever it is given a field name that is not explicitly defined, but matches a prefix or suffix used in a dynamicField.
For example the following dynamic field declaration tells Solr that whenever it sees a field name ending in "_i" which is not an explicitly defined field, then it should dynamically create an integer field with that name...
<dynamicField name="*_i" type="integer" indexed="true" stored="true"/>
Indexing same data in multiple fields
Note that, with textual data, it will often make sense to take what's logically speaking a single field (e.g. product name) and index it into several different Solr fields, each with different field options and/or analyzers.
As an example, if I had a field with a list of authors, such as:
Schildt, Herbert; Wolpert, Lewis; Davies, P.
I might want to index the same data differently in three different fields (perhaps using the Solr copyField directive):
For searching: Tokenized, case-folded, punctuation-stripped:
schildt / herbert / wolpert / lewis / davies / p
For sorting: Untokenized, case-folded, punctuation-stripped:
schildt herbert wolpert lewis davies p
For faceting: Primary author only, using a solr.StringField:
Schildt, Herbert
(See also SolrFacetingOverview.)
Expert field options
The storage of Lucene term vectors can be triggered using the following field options:
termVectors=true|false
termPositions=true|false
termOffsets=true|false
These options can be used to accelerate highlighting and other anciliary functionality, but impose a substantial cost in terms of index size. They are not necessary for typical uses of Solr (phrase queries, etc., do not require these settings to be present).
Miscellaneous Settings
In addition to the <fieldtypes> and <fields> sections of the schema, there are several other declarations that can appear in your schema....
The Unique Key Field
The <uniqueKey> declaration can be used to inform Solr that there is a field in your index which should be unique for all documents. If a document is added that contains the same value for this field as an existing document, the old document will be deleted.
It is not mandatory for a schema to have a uniqueKey field.
The Default Search Field
The <defaultSearchField> Is used by Solr when parsing queries to identify which field name should be searched in queries where an explicit field name has not been used.
Default query parser operator
The default operator used by Solr's query parser (
SolrQueryParser) can be configured with <solrQueryParser defaultOperator="AND|OR"/>. The default operator is "OR" if unspecified.
Copy Fields
Any number of <copyField> declarations can be included in your schema, to instruct Solr that you want it to duplicate any data it sees in the "source" field of documents that are added to the index, in the "dest" field of that document. You are responsible for ensuring that the datatypes of the fields are compatible, but Solr will process the information in the "dest" field using the appropriate field type (and Analyzer if it's a TextField).
This is provided as a convenient way to ensure that data is put into several fields, without needing to include the data in the update command multiple times.
Similarity
A <similarity> declaration can be used to specify the subclass of Similarity that you want Solr to use when dealing with your index. If no Similarity class is specified, the Lucene DefaultSimilarity is used. Please see SolrPlugins for information on how to ensure that your own custom Similarity can be loaded into Solr.