|Deletions are marked like this.||Additions are marked like this.|
|Line 27:||Line 27:|
|==== Merging other data ===||==== Merging other data ====|
|Line 77:||Line 77:|
|Related Idea: allow the optional specification of a specific index version during querying, and
delay closing old indicies for a short amount of time (5 seconds... configurable) to allow requests to finish.
Federated Search Design
Motivation and Goals
There is a need to support homogeneous indicies that are larger than will fit on a single machine and still provide acceptable latency.
- split an index into multiple pieces and be able to search across those pieces as if it were a single index.
- retain high-availability for queries... the search service should be able to survive single server outages.
- automate index management so clients don't have to determine which index a document belongs to (esp important for
- overwrites or deletes)
Nice to haves:
- Retain ability to have complex multi-step query handler plugins
- Retain index view consistency in a handler request (i.e. same query executed twice in a single request is guaranteed same results)
- distributed global idf calculations (a component of scoring factoring in the rareness of a term)
Merge current XML
Create an external service to simply combine the current XML results from handlers.
If sorting by something other than score, modifications would need to be made to always return the sort criteria with the document to enable merging.
Slight problem: the strings that Solr uses to represent integers and floats in a sortable/rangeable representation are *not* text and XML isn't capable of representing all unicode code points. Higher level escaping would be needed, or the use of another format like JSON.
If the merger were solr-schema aware, we could use the "external" form of the sort keys in the XML and still merge correctly by translating to index form before comparing.
Merging other data
The information that could be merged would be from a pre-determined set.
- highlighting - easily merged
- debugging - might need tweaking of the debugging format to more easily pick out specific documents
- faceted browsing - doable for simple faceted browsing that is planned for the standard request handlers.
Stateless request handlers
Have request handlers and APIs that don't use docids, and don't require query consistency.
The big distinction here is index consistency... a single request handler would be guaranteed that the view of the index does not change during a single request (just like non-federated Solr).
If the index doesn't change then:
- internal lucene docids may be used in APIs (returned to the query handler, etc)
- total consistency for multi-step requests
- total consistency for any global idf calculations (a multi-step process)
- possibly better on bandwidth (all document into need not be sent back from each segment)
Follow the basic Lucene design for MultiSearcher/RemoteSearcher as a template.
SolrMultiSearcher would implement SolrSearchable via multiple SolrSearchers, implementing the logic of combining search results from multiple subsearchers. The implementation should be network friendly (no HitCollectors, avoid passing around DocSets/BitSets, etc).
Areas that will need change:
- Solr's caches don't contain enough info to merge search results from subsearchers
could subclass DocList and add sort info, and cache that
could dynamically add the sort info if requested via the FieldCache... this would make Solr's result cache smaller.
a DocList w/ field info
Should this be more of a public API, or a private one? For RMI, it should definitely be private...
- optional global idf calculations
new style APIs geared toward faceted browsing (avoid instantiating DocSets... pass around symbolic sets)
Consistency via Retry
Maintaining a consistent view of the index can be difficult. A different method would return an index version with any request to a sub-index. If that version ever changed during the course of a request, the entire request could be restarted.
- no distributed garbage collection required... always use the latest one.
- scalability not as good... if different index parts are committing frequently at different times, the retry rate goes up as the number of sub indicies increases.
Related Idea: allow the optional specification of a specific index version during querying, and
- delay closing old indicies for a short amount of time (5 seconds... configurable) to allow requests to finish.
How can High Availability be obtained on the query side?
- sub-searchers could be identified by VIPs (top-level-searcher would go through a load-balancer to access sub-searchers).
- could do it in code via HASolrMultiSearcher that takes a list of sub-servers for each index slice.
- would need to implement failover... including not retrying a failed server for a certain amount of time (after a certain number of failures)
How should the collection be updated? It would be complex for the client to partition the data themselves, since they would have to ensure that a particular document always went to the same server. Although user partitioning should be possible, there should be an easier default.
A single master could partition the data into multiple local indicies and subsearchers would only pull the local index they are configured to have.
- hash based on unique key field to get target index
There could be a master for each slice of the index. An external module could provide the update interface and forward the request to the correct master based on the unique key field.
How to synchronize commits across subsearchers and top-level-searchers?