lucene-gdata-server Summer of Code proposal

For Apache Software Foundation by Chris Bennett <chrisrbennett@gmail.com>

A PDF version of this document is available at <link included in proposal submitted through Google>

About Me

I am currently a Cognitive Science major and Computer Science minor at the University of California at San Diego (most likely class of 2009). I am an Intern Search Engineer at Evenful.com/EVDB Inc. (My full resumé can be found at http://chrisrbennett.com/images/chrisrbennett_resume.pdf). Of relevant note, I improved Eventful’s search times by refactoring the Tomcat and Lucene-backed search implementation, see graph below: (note, not all API requests result in searches- this is a very generalized graph comparison, but still representative)

[graph contained in PDF file linked in official Google proposal - basically search times dropped from 3 seconds to 0.4 seconds, while load increased]

By the end of January, the primary index contained over 600,000 documents growing by thousands every day. My search and index queuing system handled higher search and indexing loads with faster response times and handled a rapidly growing document set. As I started logging more aggressively starting in April, I realized that indexing speed was slowing down due to excessive SQL queries initiated from our index Document builders. My current work in progress involves posting XML data from our caching engine to our search cluster, eliminating unnecessary database hits. This system is very much like that of the Solr project. All of this might lead to an XML-based document storage, retrieval, and syndication mechanism similar to that defined by the Google Data API protocol.

Basic Proposal

The Google Data API (GData) is complete protocol for delivering a complete, standardized search solution on the backend. As searching structured data becomes increasingly important internally and over the Web, the developer community would be well benefited by an open source implementation of GData. Since Lucene provides the indexing and search functionality that the protocol requires, it makes sense to implement GData using Lucene on the Java platform. There are several goals associated with creating a GData implementation using Lucene that have been highlighted on the solr-dev mailing list in addition to email communication with Yonik Seeley and Doug Cutting.

  1. Write a front-end Java servlet to handle searching, document creation, document updating (with optimistic concurrency), and document deletion as per the GData specification.
  2. Design an interface library to allow the GData request parser (front-end) to communicate with Lucene backend.
  3. Write a Lucene framework to support concurrent document modification, search, and syndication.
  4. Generate ant compile tasks and combine the previous three components into a WAR file for easy deployment in a Java servlet container.
  5. Begin transition to Solr-based backend to enable enterprise use.

Implementation Details and Milestones

Build Environment / Unit Tests (M0)

The first order of business will be to implement an ant-based build environment and prepare a framework for unit testing.

Front-end Java Servlet (M1)

The front-end to the GData parser and Lucene engine will consist of a simple Tomcat servlet to capture request parameters, POST data, manage session data, direct requests to the appropriate handler and return formatted results. In M2, this servlet will be updated to include authentication support.

GData Format Parser (M1)

The parser component will implement dom4j to parse GData protocol compliant requests. This will directly translate the POST data into a Lucene document object with additional metadata for proper handling by the indexer engine. Requests that cannot be parsed will throw an error that presents a “400 – Bad Request” to the user. This component will also generate protocol compliant output XML for search results and syndication.

Indexer Engine (M1)

The crucially important component of this project is the indexer engine that wraps around direct Lucene indexer calls. Since we do not want multiple IndexWriters (for document addition) and IndexReaders (for deletion) being opened, index documents must be queued when they cannot be committed immediately. (Although these documents must be committed as soon as possible due to the dynamic requirements of GData.) Upon IndexWriter close, the available IndexSearchers in the search queue will be closed and re-opened to contain the newest index segments.

Search Queue (M1)

The Search handler will consist of two IndexSearchers (or perhaps one, depending on performance – two seems to work well on a dual processor system, perhaps one IndexSearcher per processor?) operating in unison, accepting pre-formatted Lucene queries from a queue filled by the GData Format Parser. As soon as an IndexSearcher returns search results back to the servlet instance for output parsing, it will check a flag to determine if the index has been updated since the last open, and then process the next search request in the queue.

Enable XML configuration (M2)

XML will be used to store authentication information for implementation with M3 and M4. The XML configuration will contain a schema that allows fine-grained control over the CRUD GData actions within separate or all schemas (M3). Authentication will be fully implemented in M4.

The XML configuration will also store field-level indexing and search hints that would be helpful to Lucene for storage (Tokenized vs. Un-tokenized fields) and parsing (Integer or String sort methods).

Schema Partitioning (M3)

In order to store separate schemas with separate data and keep it in the same index, the Indexer Engine will store the user-specified schema name into a document field and constrain searches to within that schema when requested. The schema names will also be used for authorization for realms defined within the XML configuration in M2.

Authentication Handler (M4)

The authentication handler will allow request callers to authenticate and get a session token for further requests to the servlet. This handler will validate usernames and passwords within the requested schema.

Configuration interface (M5)

It would be useful for an administrator to be able to view and modify the XML configuration and view current Indexer, Searcher, and index store status from within an HTML interface. This feature would be extremely helpful for debugging GData queries and results.

Logging in log4j (M5)

Finally, this web application will implement controllable detail logging for debugging of the application and usage statistics generation.

Planning

ChrisBennett/SummerOfCodeGdataProposal (last edited 2009-09-20 23:35:23 by localhost)