Differences between revisions 2 and 3
Revision 2 as of 2006-07-11 17:26:20
Size: 2175
Editor: YonikSeeley
Comment:
Revision 3 as of 2006-08-25 01:37:50
Size: 4389
Editor: YonikSeeley
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
== Motivation and Goals ==
There is a need to support homogeneous indicies that are larger than will fit on a single machine and still provide acceptable latency.
Line 3: Line 5:
Goals:
 * split an index into multiple pieces and be able to search across those pieces as if it were a single index.
 * retain high-availability for queries... the search service should be able to survive single server outages.
 * automate index management so clients don't have to determine which index a document belongs to (esp important for
   overwrites or deletes)

Nice to haves:
 * Retain ability to have complex multi-step query handler plugins
 * Retain index view consistency in a handler request (i.e. same query executed twice in a single request is guaranteed same results)
 * distributed global idf calculations (a component of scoring factoring in the rareness of a term)

== Simple Federation ==
=== Merge current XML ===
Create an external service to simply combine the current XML results from handlers.

==== Merging documents ====
If sorting by something other than score, modifications would need to be made to always return the sort criteria with the document to enable merging.

This is slightly more difficult than it appears... the strings that Solr uses to represent integers and floats in a sortable/rangeable representation are *not* text and XML isn't capable of representing all unicode code points. Higher level escaping would be needed, or the use of another format like JSON.

If the merger were solr-schema aware, we could use the "external" form of the sort keys in the XML and still merge correctly by translating to index form before comparing.

==== Merging other data ===
The information that could be merged would be from a pre-determined set.
 * highlighting - easily merged
 * debugging - might need tweaking of the debugging format to more easily pick out specific documents
 * faceted browsing - whatever could be done in a single-shot request should be OK

=== Stateless request handlers ===
Have request handlers and APIs that don't use docids, and don't require query consistency.


== Complex Federation ==
Line 32: Line 67:
=== Master === == Master ==
Line 35: Line 70:
==== Single Master ====
A single master could partition the data into multiple local indicies... subsearchers would only pull the local index they are configured to have.

=== Single Master ===
A single master could partition the data into multiple local indicies and subsearchers would only pull the local index they are configured to have.
 * hash
based on unique key field to get target index

Federated Search Design

Motivation and Goals

There is a need to support homogeneous indicies that are larger than will fit on a single machine and still provide acceptable latency.

Goals:

  • split an index into multiple pieces and be able to search across those pieces as if it were a single index.
  • retain high-availability for queries... the search service should be able to survive single server outages.
  • automate index management so clients don't have to determine which index a document belongs to (esp important for
    • overwrites or deletes)

Nice to haves:

  • Retain ability to have complex multi-step query handler plugins
  • Retain index view consistency in a handler request (i.e. same query executed twice in a single request is guaranteed same results)
  • distributed global idf calculations (a component of scoring factoring in the rareness of a term)

Simple Federation

Merge current XML

Create an external service to simply combine the current XML results from handlers.

Merging documents

If sorting by something other than score, modifications would need to be made to always return the sort criteria with the document to enable merging.

This is slightly more difficult than it appears... the strings that Solr uses to represent integers and floats in a sortable/rangeable representation are *not* text and XML isn't capable of representing all unicode code points. Higher level escaping would be needed, or the use of another format like JSON.

If the merger were solr-schema aware, we could use the "external" form of the sort keys in the XML and still merge correctly by translating to index form before comparing.

==== Merging other data === The information that could be merged would be from a pre-determined set.

  • highlighting - easily merged
  • debugging - might need tweaking of the debugging format to more easily pick out specific documents
  • faceted browsing - whatever could be done in a single-shot request should be OK

Stateless request handlers

Have request handlers and APIs that don't use docids, and don't require query consistency.

Complex Federation

Follow the basic Lucene design for MultiSearcher/RemoteSearcher as a template.

Areas that will need change:

  • Solr's caches don't contain enough info to merge search results from subsearchers
    • could subclass DocList and add sort info, and cache that

    • could dynamically add the sort info if requested via the FieldCache... this would make Solr's result cache smaller.

    • probably want to re-use FieldDocSortedHitQueue, which means returning TopFieldDocs, or creating them on the fly from

Network Transports

  • RMI
  • XML

Should this be more of a public API, or a private one? For RMI, it should definitely be private...

Misc:

  • optional global idf calculations
  • new style APIs geared toward faceted browsing (avoid instantiating DocSets... pass around symbolic sets)

High Availability

How can High Availability be obtained on the query side?

  • sub-searchers could be identified by VIPs (top-level-searcher would go through a load-balancer to access sub-searchers).
  • could do it in code via HASolrMultiSearcher that takes a list of sub-servers for each

Master

How should the collection be updated? It would be complex for the client to partition the data themselves, since they would have to ensure that a particular document always went to the same server. Although user partitioning should be possible, there should be an easier default.

Single Master

A single master could partition the data into multiple local indicies and subsearchers would only pull the local index they are configured to have.

  • hash based on unique key field to get target index

Commits

How to synchronize commits across subsearchers and top-level-searchers?

FederatedSearch (last edited 2009-09-20 22:05:24 by localhost)