Summer of Code Proposal - Solr Mail Archive

Subject ID

solr-mail-archive http://wiki.apache.org/general/SummerOfCode2006#solr-mail-archive

Subject

Allowing Apache's mailing lists, including archives, to be searched via Solr

Author

Marcel Gordon

Background

Author

I'm an undergraduate student in my final year of B Computer Science / LLB at the University of Wollongong, Australia. I have worked as a programmer on web development, desktop applications and server-side Java. I presently teach Java and C++ to post-secondary students.

I am interested in this project because it combines a number of different areas - see below - and will expose me to technologies which I have not had an opportunity to use before.

Project

Solr http://incubator.apache.org/solr/ is an "open source enterprise search server". The proposal is to allow Apache's mailing lists to be searched on this platform, via a Web interface.

Moving search to a specialised application such as this greatly improves efficiency, leveraging a sophisticated platform, and minimises the workload borne on servers with more critical jobs (such as handling mailing lists).

Deliverables

Three main deliverables are proposed.

  1. A tool to index existing mailing list archives.
  2. A piece of software is needed to monitor mailing lists and index messages as they are received.
  3. A Web interface to allow the mailing lists to be searched.

Benefits for the Apache community

There is a clear benefit for the Apache community - efficient and powerful search on the Apache mailing lists. Further, the tools can be applied to any mailing list, providing the basis for a suite which Apache can offer to its user community. Finally, the Solr project would have an excellent, publicly-available example of the way in which Solr can be put to use.

Design / Approach

All parts of this project ought to be well documented. That goes without saying, but is particularly important in this case because this project could be used as a demonstration of Solr.

Archive Indexing Tool

The archive indexer would be a fairly straightforward program, processing a mail file to extract the relevant fields. It could either feed the data straight into the Solr server or write it to a data file for posting at a later date using an existing script. The program should be able to specify a date range for which messages are to be indexed.

It would be advantageous to allow the program to understand formats dynamically by reading mapping files, although such a feature is not necessary for the project at hand.

Any number of languages (eg Perl, C++, Python) could be used to implement such a tool.

Monitoring Mailing Lists

The Apache mailing lists produce Atom feeds, which can be polled in order to detect new messages (thanks Doug). The monitoring program should allow configuration of the poll frequency, either globally or on a per-list basis.

One important issue to deal with is covering any possible missed messages. It would be advantageous to allow the monitoring program's configuration to be changed without restarting (for example, to add new mailing lists or change polling frequency) in order to maintain regular operation. In addition, the program should be able to notify that a period of time has been missed (the difference between a lists last indexed message and the last message on its Atom feed), in order to allow the administrator to run the indexing tool on the archives. Better yet, it should be able to run the tool on the archives automatically. The archive indexing component should be developed with this in mind.

Again, any number of languages could be used to implement this piece of software.

Web interface

The project specification indicates that "the search interface could be a form with text boxes for sender, subject and content, and perhaps pulldowns for mailing-list and date. Results could be sortable by any field. Optionally, faceted search by sender/month/mailing-list, etc. could be useful."

Given the GET/POST interface used by Solr, AJAX would be an excellent choice for the Web interface.

Further, a generic interface should be able to be generated from the file which specifies the fields used in Solr. This is clearly outside the scope of the project, but is highlighted as an interesting extension should time allow.

Timeline

There are three discrete tools required. I have assigned two weeks for the first, four weeks for the second and two weeks for the third, with two weeks remaining for testing and documentation. The other two and a half weeks I will be on away in NZ.

Date

Task

May 23

Project commences

June 6

Archive indexing tool complete

June 20

List monitoring program runs as a standalone application

June 21

Marcel goes on holiday to NZ

July 10

Marcel returns to Oz

July 24

List monitoring program runs as daemon, auto-recovering missed messages

August 7

Web interface complete

August 20

Documentation and testing complete

August 21

Project submission