Subject ID |
solr-mail-archive |
Title |
Build a Solr-based search engine for Apache's email archives |
Developer |
Khuc Ngoc Vinh |
Abstract
Solr ("Solar") is an open source search server based on Lucene Java library with web service like API. That means we can index documents via XML/HTTP, query via HTTP GET and receive XML results. Thanks to Lucene search engine library, it can provide advanced full-text search capabilities, and scalability by connecting to other Solr search servers. Solr also has administrator web interface. The purpose of this project is to provide full text search feature for Apache email archives by using Solr search server.
Project Overview
This project will be based on Solr nightly build 04.05http://cvs.apache.org/dist/lucene/solr/nightly/solr-2006-05-04.zip, and provide search feature for Apache's email archives, which are located at http://mail-archives.apache.org.
As Yonik said "Solr may be in the incubator, but it's already relatively stable and used in production systems", so starting to develop with Solr now is not a problem. Thanks Yonik.
Project Description
The final destination of this project is to provide a search interface, which contains a form with text boxes for sender, subject and content, and perhaps pulldowns for mailing-list and date. Results will be sortable by any field, and will be highlighted by various colors over the match-words. Moreover this project can also provide an useful tool that helps to integrate Solr search servers to other mailing list archives like Apache.
Planning
Java is my main programming language, so I will choose Java (1.5) to develop this project. I have developed a project using Lucene Library. Solr uses Lucene engine. So using Solr to search Apache's mail archives is a good choice.
Indexing can be done easily by invoking HTTP POST XML file to Solr server.
This project will be divided into 5 steps:
- Step 1: Monitoring mail archives thanks to Atom 1.0, which is provided in Apache's mail archives.
- Step 2: Indexing data by invoking HTTP POST XML file.
- Step 3: Converting result from XML format to HTML.
- Step 4: Building web interface to provide search feature arround mail archives.
- Step 5: Testing the whole project.
Step 1
As Yonik mentioned Doug's words: "Apache supplys an Atom feed for each mailing list, so it should be possible to poll these for new messages. For example: http://mail-archives.apache.org/mod_mbox/lucene-solr-dev/?format=atom", so the tool to monitor the whole mail archives will be based on Atom 1.0.
Step 2
Indexing process will only index the sender, subject and content of a mail, and will skip the Original Message, also other lines which begin with ">". We can use commands:
curl http://SolrServer --data-binary data.xml
where data.xml contains:
<add>
<doc>
<field name="sender">...</field>
<field name="subject">...</field>
<field name="content">...</field>
</doc>
</add>
and then commit with:
curl http://SolrServer --data-binary '<commit/>'
Commit is an expensive operation, so we should only use 'commit' when we add enough data.
Step 3
When searching by invoking HTTP GET to Solr server, we will get result in XML format. Converting result in XML format to HTML can be done by using XSTL.
Step 4
Create web interface that provides full text search in Apache's email archives at http://mail-archives.apache.org . I see that Apache's email archives is using Ajax technique to provide quick-rendering functionality, so adding Ajax to web interface here sounds like a must-have. Use hightlight library to provide hightlight feature.
Step 5
This is the final step. We have to test the whole project. We will have many found bug fixed in this step.
Schedule
- Step 1: Should be completed in June 14, 2006
- Step 2: Should be completed in June 29, 2006
- Step 3: Should be completed in July 15, 2006
- Step 4: Should be completed in August 7, 2006
- Step 5: Should be completed in August 21, 2006
Bio
I am a student of the Computer Science Department of Moscow State University, Russia. I am interested in integrating Lucene Search Engine to provide quick full text search feature in complex web applications.
Development Methodology
- UML Modeling
- Using Ant to build project is a must-have
- Controlling version with CVS, SVN
- Generating documentation with javadoc tool.
- Testing with JUnit
The best individual to do this project
I am using Lucene Library in my project: JSMBSearch https://jsmbsearch.dev.java.net. I have some experience in using Lucene. Java is my main programming language.
I like Apache's projects, and want to contribute to them. So, I think I am the best individual to do this project. This summer is a good chance for me!