Differences between revisions 1 and 2
Revision 1 as of 2009-02-02 09:42:41
Size: 2620
Editor: adsl-75-55-126-95
Comment:
Revision 2 as of 2009-02-02 10:20:59
Size: 3576
Editor: adsl-75-55-126-95
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
Mapping databases to Solr.
Solr provides one table. Storing a set database tables in an index generally requires denormalizing some of the tables.
'''Mapping databases to Solr'''[[BR]]
Solr provides one table. Storing a set database tables in an index generally requires denormalizing some of the tables. Attempts to avoid denormalizing usually fail.
Line 6: Line 6:
Sorting 
There are two ways of sorting available in Solr 1.4.
'''Sorting'''[[BR]]
There are two ways of sorting available in Solr 1.4: Lucene's sorting feature and function queries.
Line 9: Line 9:
Lucene sort and field types:

The Solr sort parameter uses the Lucene sorting tool. This creates an array containing an entry for every document in the index. Sorting is then done against this array. This array is cached across requests and so repeated sorts are fast. If the field type is 'integer' the array contains only that int and thus is 4 bytes * the number of documents. If the field type is anything else, this integer array is created and then a separate array is also created with much more data (??) per entry. Sorting is also slower if the type is not an 'integer'.
'''Lucene Sorting'''[[BR]]
The Solr sort parameter uses the Lucene sorting tool. This creates an array containing an entry for every document in the index. Sorting is then done against this array. This array is cached across requests and so repeated sorts are fast. If the field type is 'integer' the array contains only that value and thus is 4 bytes * the number of documents. If the field type is anything else, this integer array is created and then a separate array is also created with that field's data per entry. Sorting is also slower if the type is not an 'integer'.
Line 23: Line 22:
Text search: '''Function Query Sorting'''[[BR]]
Add this clause to your query string to sort the results using 'myIndexedField'. Do not use the 'sort=field+asc' parameter. See [FunctionQuery] for more.
{{{
_val_:"ord(myIndexedField)"
}}}
There may be performance differences with this technique v.s. the Lucene sorting algorithm.
Line 25: Line 29:
Phrase search:
If you store "To Be Or Not To Be" in a "text" field, none of these words will find this document, nor will the phrase in quotes. The problem is that the "text" field does not store the input data, but an altered version. If you want to have any phrase search work as well as individual words, you need to have two fields. Both should be processed similarly, but the phrase search field should not use "stemming" or "stopwords".
'''Alternative Text Search Field types'''
The "text" field type in the example schema.xml provides basic text search for English text. But, it has a surprise: the actual text given to this field is not indexed as-is, and therefore searching for the raw text may not work. If you store "To Be Or Not To Be" in a "text" field, none of these words will find this document, nor will the phrase in quotes.
Line 28: Line 32:
Phonemes:
Programmers are perfect spellers and expect the same of their users. A phoneme represents (roughly) the sound of one syllable. Phoneme-based searching can give users a better search experience. The Metaphone & other phoneme filters cause the index to store phoneme-base representations of the text instead of the input. So, phoneme filters need to be in both the index and query stacks. Of the several available the DoubleMetaphone filter seems to be the most popular and does well with non-English text. ([http://en.wikipedia.org/wiki/Soundex Soundex] was invented 90 years ago!)
'''Phrase search'''[[BR]]
If you want to have any phrase search work as well as individual words, you need to have two fields. Both should be processed similarly, but the phrase search field should not use "stemming" or "stopwords". Usually use can populate this field using the <copyField> directive.

'''Phonemes'''

Programmers are perfect spellers and expect the same of their users. A phoneme represents (roughly) the sound of one syllable. Phoneme-based searching can give users a better search experience. To support misspelled search words Phoneme filters cause the index to store phoneme-base representations of the text instead of the input.

To create a phoneme-based field
, you need a text filter stack that does not include stemming or stopwords, and add the solr.PhoneticFilterFactory (see [AnalyzersTokenizersTokenFilters]) with one of the available encoders. This must be in both the indexing and query stack. Of the several available the "Double Metaphone" filter is the most popular and does well with non-English text. There are as yet no language-specific phoneme encoders.

For another take on assisting spelling, see
[SpellCheckComponent].

General tips & tricks in designing schemas.

Mapping databases to SolrBR Solr provides one table. Storing a set database tables in an index generally requires denormalizing some of the tables. Attempts to avoid denormalizing usually fail.

SortingBR There are two ways of sorting available in Solr 1.4: Lucene's sorting feature and function queries.

Lucene SortingBR The Solr sort parameter uses the Lucene sorting tool. This creates an array containing an entry for every document in the index. Sorting is then done against this array. This array is cached across requests and so repeated sorts are fast. If the field type is 'integer' the array contains only that value and thus is 4 bytes * the number of documents. If the field type is anything else, this integer array is created and then a separate array is also created with that field's data per entry. Sorting is also slower if the type is not an 'integer'.

However, range checks do not work on an 'integer' field. If you want range checks and fast sorting, you can create a pair of fields, one of each type, with a copyField directive:

 <field name="popularity" type="sint" indexed="true" stored="true" multiValued="false"/>
 <field name="popularitySort" type="integer" indexed="true" stored="false" />
 ...
 <copyField source="popularity" dest="popularitySort"/>

Note that since multiValued=false is the default for these types, attempting to store a value to 'popularitySort' will cause an indexing error, since it also always receives a value from 'popularity'. Also there is no reason to store both fields, and so 'popularitySort' is index-only.

Function Query SortingBR Add this clause to your query string to sort the results using 'myIndexedField'. Do not use the 'sort=field+asc' parameter. See [FunctionQuery] for more.

_val_:"ord(myIndexedField)"

There may be performance differences with this technique v.s. the Lucene sorting algorithm.

Alternative Text Search Field types The "text" field type in the example schema.xml provides basic text search for English text. But, it has a surprise: the actual text given to this field is not indexed as-is, and therefore searching for the raw text may not work. If you store "To Be Or Not To Be" in a "text" field, none of these words will find this document, nor will the phrase in quotes.

Phrase searchBR If you want to have any phrase search work as well as individual words, you need to have two fields. Both should be processed similarly, but the phrase search field should not use "stemming" or "stopwords". Usually use can populate this field using the <copyField> directive.

Phonemes Programmers are perfect spellers and expect the same of their users. A phoneme represents (roughly) the sound of one syllable. Phoneme-based searching can give users a better search experience. To support misspelled search words Phoneme filters cause the index to store phoneme-base representations of the text instead of the input.

To create a phoneme-based field, you need a text filter stack that does not include stemming or stopwords, and add the solr.PhoneticFilterFactory (see [AnalyzersTokenizersTokenFilters]) with one of the available encoders. This must be in both the indexing and query stack. Of the several available the "Double Metaphone" filter is the most popular and does well with non-English text. There are as yet no language-specific phoneme encoders.

For another take on assisting spelling, see [SpellCheckComponent].

SchemaDesign (last edited 2012-05-21 19:13:42 by adsl-75-51-164-120)