Subject ID

solr-mail-archive

Title

Build a Solr-based search engine for Apache's email archives

Developer

Khuc Ngoc Vinh

Abstract

Solr ("Solar") is an open source search server based on Lucene Java library with web service like API. That means we can index documents via XML/HTTP, query via HTTP GET and receive XML results. Thanks to Lucene search engine library, it can provide advanced full-text search capabilities, and scalability by connecting to other Solr search servers. Solr also has administrator web interface. The purpose of this project is to provide full text search feature for Apache email archives by using Solr search server.

Project Overview

This project will be based on Solr nightly build 04.05http://cvs.apache.org/dist/lucene/solr/nightly/solr-2006-05-04.zip, and provide search feature for Apache's email archives, which are located at http://mail-archives.apache.org.

As Yonik said "Solr may be in the incubator, but it's already relatively stable and used in production systems", so starting to develop with Solr now is not a problem. Thanks Yonik.

Project Description

The final destination of this project is to provide a search interface, which contains a form with text boxes for sender, subject and content, and perhaps pulldowns for mailing-list and date. Results will be sortable by any field, and will be highlighted by various colors over the match-words. Moreover this project can also provide an useful tool that helps to integrate Solr search servers to other mailing list archives like Apache.

Planning

Java is my main programming language, so I will choose Java (1.5) to develop this project. I have developed a project using Lucene Library. Solr uses Lucene engine. So using Solr to search Apache's mail archives is a good choice.

Indexing can be done easily by invoking HTTP POST XML file to Solr server.

This project will be divided into 5 steps:

Step 1

As Yonik mentioned Doug's words: "Apache supplys an Atom feed for each mailing list, so it should be possible to poll these for new messages. For example: http://mail-archives.apache.org/mod_mbox/lucene-solr-dev/?format=atom", so the tool to monitor the whole mail archives will be based on Atom 1.0.

Step 2

Indexing process will only index the sender, subject and content of a mail, and will skip the Original Message, also other lines which begin with ">". We can use commands:

curl http://SolrServer --data-binary data.xml

where data.xml contains:

<add>

<doc>

<field name="sender">...</field>

<field name="subject">...</field>

<field name="content">...</field>

</doc>

</add>

and then commit with:

curl http://SolrServer --data-binary '<commit/>'

Commit is an expensive operation, so we should only use 'commit' when we add enough data.

Step 3

When searching by invoking HTTP GET to Solr server, we will get result in XML format. Converting result in XML format to HTML can be done by using XSTL.

Step 4

Create web interface that provides full text search in Apache's email archives at http://mail-archives.apache.org . I see that Apache's email archives is using Ajax technique to provide quick-rendering functionality, so adding Ajax to web interface here sounds like a must-have. Use hightlight library to provide hightlight feature.

Step 5

This is the final step. We have to test the whole project. We will have many found bug fixed in this step.

Schedule

Bio

I am a student of the Computer Science Department of Moscow State University, Russia. I am interested in integrating Lucene Search Engine to provide quick full text search feature in complex web applications.

Development Methodology

The best individual to do this project

I am using Lucene Library in my project: JSMBSearch https://jsmbsearch.dev.java.net. I have some experience in using Lucene. Java is my main programming language.

I like Apache's projects, and want to contribute to them. So, I think I am the best individual to do this project. This summer is a good chance for me!