Differences between revisions 6 and 7
Revision 6 as of 2006-08-25 19:25:19
Size: 6409
Editor: YonikSeeley
Comment:
Revision 7 as of 2006-08-28 19:29:58
Size: 8837
Editor: YonikSeeley
Comment:
Deletions are marked like this. Additions are marked like this.
Line 55: Line 55:
 * SolrQueryRequest currently returns a Searcher... it would beed to be a MultiSearcher
Line 56: Line 57:
Network Transports Optional:
 * Support for plugins at the sub-searcher level? Normally, custom handlers would only be
   invoked at the super-searcher level... but if there are any operations that require low
   level IndexReader access to perform efficiently (and we don't provide an API), it would be
   nice to allow a super-searcher handler to invoke a sub-searcher handler. User code would
   be invoked to merge results from all the sub-searchers.

Transport Syntax:

If custom query handlers are to be allowed with Federated Search, there is a problem of how to represent Queries. The Lucene MultiSearcher distributes Weights to sub-searchers... this was done for global-idf reasons... the weights are calculated in the super-searcher and can thus use doc counts from all sub-searchers.

 * Use query strings (Lucene QueryParser)
  * Query strings need to be reparsed by every sub-searcher, increasing CPU
  * Custom query handlers would be limited to using query strings to remote APIs as
    the process isn't reversible (can't go from Query to QueryString)
  * Attempt to provide schema aware re-parsable Query.toString() support for all query classes
    we can think of (but there could always be others we don't know about)
  * many query types aren't representable in QueryParser syntax
 * Use serialized Query objects (binary)
  * re-use built-in support already there for serializing objects
  * Solr's caches are Query based, so it would be better to use Query rather than Weight
    to pass to subsearchers
  * Global IDF can still be done w/o passing Weights... global term doc counts can be passed
    with the request and the subsearchers should be able to weight accordingly.
  * serialized Query objects are binary... this means the transport syntax would be RMI, or at least HTTP with a binary body.
 * Use serialized Query objects (text)
  * Use some other human readable representation of queries... good for debugging, bad for
    support of unknown queries.
  * could perhaps wrap binary serialization (base64) for query types we don't know about.

Network Transports:
Line 58: Line 89:
  * if distributed garbage collection is needed for maintaining a consistent view, RMI already does this   * If distributed garbage collection is needed for maintaining a consistent view, RMI already does this
  * Are connections too sticky? Will the same client always go to the same server if we are going through
    a load balancer?
    * for the most control, implement load balancing ourselves
Line 61: Line 95:


 * new style APIs geared toward faceted browsing geared toward a distributed solution (avoid instantiating DocSets... pass around symbolic sets)
  * requires a lot of marshalling code
 
Need
new style APIs geared toward faceted browsing for a distributed solution (avoid instantiating DocSets... pass around symbolic sets and set operations)

Federated Search Design

Motivation and Goals

There is a need to support homogeneous indicies that are larger than will fit on a single machine and still provide acceptable latency.

Goals:

  • split an index into multiple pieces and be able to search across those pieces as if it were a single index.
  • retain high-availability for queries... the search service should be able to survive single server outages.
  • automate index management so clients don't have to determine which index a document belongs to (esp important for
    • overwrites or deletes)

Nice to haves:

  • Retain ability to have complex multi-step query handler plugins
  • Retain index view consistency in a handler request (i.e. same query executed twice in a single request is guaranteed same results)
  • distributed global idf calculations (a component of scoring factoring in the rareness of a term)

Simple Federation

Merge current XML

Create an external service to simply combine the current XML results from handlers.

Merging documents

If sorting by something other than score, modifications would need to be made to always return the sort criteria with the document to enable merging.

Slight problem: the strings that Solr uses to represent integers and floats in a sortable/rangeable representation are *not* text and XML isn't capable of representing all unicode code points. Higher level escaping would be needed, or the use of another format like JSON.

If the merger were solr-schema aware, we could use the "external" form of the sort keys in the XML and still merge correctly by translating to index form before comparing.

Merging other data

The information that could be merged would be from a pre-determined set.

  • highlighting - easily merged
  • debugging - might need tweaking of the debugging format to more easily pick out specific documents
  • faceted browsing - doable for simple faceted browsing that is planned for the standard request handlers.

Stateless request handlers

Have request handlers and APIs that don't use docids, and don't require query consistency.

Complex Federation

The big distinction here is index consistency... a single request handler would be guaranteed that the view of the index does not change during a single request (just like non-federated Solr).

If the index doesn't change then:

  • internal lucene docids may be used in APIs (returned to the query handler, etc)
  • total consistency for multi-step requests
  • total consistency for any global idf calculations (a multi-step process)
  • possibly better on bandwidth (all document into need not be sent back from each segment)

Follow the basic Lucene design for MultiSearcher/RemoteSearcher as a template.

Areas that will need change:

  • Solr's caches don't contain enough info to merge search results from subsearchers
    • could subclass DocList and add sort info, and cache that

    • could dynamically add the sort info if requested via the FieldCache... this would make Solr's result cache smaller.

    • might want to re-use FieldDocSortedHitQueue, which means returning TopFieldDocs, or creating them on the fly from a DocList w/ field info

  • SolrQueryRequest currently returns a Searcher... it would beed to be a MultiSearcher

Optional:

  • Support for plugins at the sub-searcher level? Normally, custom handlers would only be
    • invoked at the super-searcher level... but if there are any operations that require low

      level IndexReader access to perform efficiently (and we don't provide an API), it would be nice to allow a super-searcher handler to invoke a sub-searcher handler. User code would be invoked to merge results from all the sub-searchers.

Transport Syntax:

If custom query handlers are to be allowed with Federated Search, there is a problem of how to represent Queries. The Lucene MultiSearcher distributes Weights to sub-searchers... this was done for global-idf reasons... the weights are calculated in the super-searcher and can thus use doc counts from all sub-searchers.

  • Use query strings (Lucene QueryParser)

    • Query strings need to be reparsed by every sub-searcher, increasing CPU
    • Custom query handlers would be limited to using query strings to remote APIs as
      • the process isn't reversible (can't go from Query to QueryString)

    • Attempt to provide schema aware re-parsable Query.toString() support for all query classes
      • we can think of (but there could always be others we don't know about)
    • many query types aren't representable in QueryParser syntax

  • Use serialized Query objects (binary)
    • re-use built-in support already there for serializing objects
    • Solr's caches are Query based, so it would be better to use Query rather than Weight
      • to pass to subsearchers
    • Global IDF can still be done w/o passing Weights... global term doc counts can be passed
      • with the request and the subsearchers should be able to weight accordingly.
    • serialized Query objects are binary... this means the transport syntax would be RMI, or at least HTTP with a binary body.
  • Use serialized Query objects (text)
    • Use some other human readable representation of queries... good for debugging, bad for
      • support of unknown queries.
    • could perhaps wrap binary serialization (base64) for query types we don't know about.

Network Transports:

  • RMI
    • If distributed garbage collection is needed for maintaining a consistent view, RMI already does this
    • Are connections too sticky? Will the same client always go to the same server if we are going through
      • a load balancer?
      • for the most control, implement load balancing ourselves
  • XML/HTTP
    • easier to debug, easier to load-balance?
    • requires a lot of marshalling code

Need new style APIs geared toward faceted browsing for a distributed solution (avoid instantiating DocSets... pass around symbolic sets and set operations)

Consistency via Retry

Maintaining a consistent view of the index can be difficult. A different method would return an index version with any request to a sub-index. If that version ever changed during the course of a request, the entire request could be restarted.

Advantages:

  • no distributed garbage collection required... always use the latest one.

Disadvantages:

  • scalability not as good... if different index parts are committing frequently at different times, the retry rate goes up as the number of sub indicies increases.

Related Idea: allow the optional specification of a specific index version during querying, and

  • delay closing old indicies for a short amount of time (5 seconds... configurable) to allow requests to finish.

High Availability

How can High Availability be obtained on the query side?

  • sub-searchers could be identified by VIPs (top-level-searcher would go through a load-balancer to access sub-searchers).
  • could do it in code via HASolrMultiSearcher that takes a list of sub-servers for each index slice.
    • would need to implement failover... including not retrying a failed server for a certain amount of time (after a certain number of failures)

Master

How should the collection be updated? It would be complex for the client to partition the data themselves, since they would have to ensure that a particular document always went to the same server. Although user partitioning should be possible, there should be an easier default.

Single Master

A single master could partition the data into multiple local indicies and subsearchers would only pull the local index they are configured to have.

  • hash based on unique key field to get target index

Directory structure for indicies: Current: solr/data/index OptionA: solr/data/index0, solr/data/index1, solr/data/index2,

Multiple Masters

There could be a master for each slice of the index. An external module could provide the update interface and forward the request to the correct master based on the unique key field.

The directiry structure for indicies would not have to change since each master would have it's own solr home and hence it's own index directory.

Commits

How to synchronize commits across subsearchers and top-level-searchers?

Misc

Any realistic way to use Hadoop?

FederatedSearch (last edited 2009-09-20 22:05:24 by localhost)