(This page is a child of the TaskList page)

Problem

Solr would benefit from a flexible document processing framework meeting the requirements of enterprise grade content integration. Most search projects have some need for processing the incoming content prior to indexing, for example:

The built-in UpdateRequestProcessorChain is capable of doing simple simple processing jobs, but it is only built for local execution on the indexer node in the same thread. This means that any performance heavy processing chains will slow down the indexers without any way to scale out processing independently. We have seen FAST systems with far more servers doing document processing than indexing.

There are many processing pipeline frameworks from which to get inspiration, such as the one in FAST ESP, OpenPipeline, OpenPipe (now on GitHub), Pypes, UIMA, Eclipse SMILA, Apache commons pipeline, Piped, Behemoth, Findwise's yet-to-be-announced pipeline and others. Indeed, some of these are already being used with Solr as a pre-processing server.

A choice of technologies is good, but it can be a bit too much and fragmented as well...

There have recently been interest within the search community for a true open source pipeline with a healthy community behind it and a rich pool of processors. See this presentation from Lucene Eurocon 2010 as well as this blog post for thoughts from FindWise, as well as the recent solr-user thread Pipeline for Solr and Cominvent's talk at Lucene Eurocon 2011 Improving Solr's Update Chain. In addition to developing a true open source preferred solution, it should also be possible to improve interoperability and compatibility.

Here are a few things that we could consider in order to ease this situation:

*Update*: At Lucene Eurocon 2011 in Barcelona, we formed an interest group for pipelines which has its home at http://www.meetup.com/SearchPipelines/

Wishes for a Lucene targeted pipeline

Here are some thoughts and wishes for a new pipeline project mainly target at Lucene based search enginens (including Solr, ElasticSearch and Lucene itself). It should probably build upon/fork one of the existing projects and best practices.

Key requirements

Must

Should

Could

Anti-patterns

Proposed architecture

Jan H√łydahl: I think OpenPipe is a hot candidate to fork as a new open source framework. It already supports most of the above, is Apache licensed, and is abandoned by its original developers.

Risks

TBD

Q&A

Your question here

DocumentProcessing (last edited 2011-10-27 11:00:29 by JanHoydahl)