Ideas for James at Google Summer of Code 2011

Context: James (Java Application Mail Server) is a set of mail-related libraries bundled in a server (http://james.apache.org). It supports standard mails protocols (smtp, pop3, imap4) and can store the mails in different technologies (maildir, database, jcr).

Students: We are looking for Students to help add more distributed storages (hadoop and nosql). We are also looking for Students to bring end-user with more functionality such as mails filter/categorization and "out-of-office.A good knowledge of JAVA programming language is required. Knowledge of email protocols and nosql storage systems is welcome (but not required).


Design and implement a distributed mailbox using Hadoop

Context: The mailbox subproject (http://james.apache.org/mailbox/) supports maildir, SQL database (via JPA) and Java Content Repository (JCR) as technology for mail storage. This flexibility is achieved thanks to a API design that abstracts mail storage from the mail protocols.

Task: We need to implement mailbox storage as a distributed system on top of Hadoop HDFS. The James mailbox API will be used. A first step is to design how to interact with Hadoop (native api, gora incubator at apache,...) and deal with specific performance questions related to mail loading/parsing in a distributed system (use map/reduce or not, use existing local lucene indexes for search,...). The second step is to implement the HDFS mailbox (maildir mailbox is similar because is stores mails as a file and can be an inspiration). A single James server will still be deployed because we don't have any distributed UID generation.

Mentor: eric at apache dot org & [fill in mentor]

Complexity: medium


Design and implement Distributed UID generation

Context: IMAP4rev1 (RFC3501 requires that every message is identified by a stable 32-bit Unique Identifier (UID) assigned in incremental sequence. This is now achieved in James IMAP subproject (http://james.apache.org/imap) with a UidProvider interface implemented in memory. This implementation does not allow distributed working of the solution.

Task: A DistributedUidProvider must be designed. The design can rely on a distributed memory cache such as hazelcast , or any other solution (hadoop, hbase, cassandra,...), and implemented.

Mentor: eric at apache dot org & [fill in mentor]

Complexity: medium


Design and implement machine learning filters and categorization for mail

Context: Anti-spam functionality based on SpamAssassin is available at James (base on mailets http://james.apache.org/mailet). Bayesian mailets are also available, but not completely integrated/documented. Nothing is available to automatically categorize mail traffic per user.

Task: We are willing to align the existing implementation with any modern anti-spam solution based on powerfull machine learning implementation (such as apache mahout). We are also willing to extend the machine learning usage to some mail categorization (spam vs not-spam is a first category, we can extend it to any additional category we can imagine). The implementation can partially occur while spooling the mails and/or when mail is stored in mailbox.

Related discussions: See also discussions on mail intelligent mining on http://markmail.org/message/2bodrwvdvtfq3f2v (mahout related) and http://markmail.org/thread/pksl6csyvoeo27yh (hama related).

Mentor: eric at apache dot org & [fill in mentor]

Complexity: high


SIEVE Extensions

Context: SIEVE is an email filtering language. JSieve is the James implementation. In recent years, the IEFT sieve working group has been active in defining extensions. Users would also benefit from exposing features of existing mailets as extensions.

Task: Learn to work with a modern domain specific language, and perhaps refector the existing architecture as needed. Negotiate a realistic set of target extensions to implementation, and then get started :-)

Mentor: rdonkin at apache dot org & [fill in mentor]

Complexity: easy


Design and Implement Mailbox with NoSQL Storage

Context: The mailbox subproject (http://james.apache.org/mailbox/) supports maildir, SQL database (via JPA) and Java Content Repository (JCR) as technology for mail storage. This flexibility is achieved thanks to a API design that abstracts mail storage from the mail protocols.

Task: NoSQL storage (for example CouchDB or Cassandra) has great potential for mail storage. Design and develop an suitable RESTful integration API and implementations for as many NoSQL targets as possible in the time.

Mentor: rdonkin at apache dot org & [fill in mentor]

Complexity: easy


Add "out-of-office" functionality

Context: A frequently asked function is to have the ability to set per user a "out-of-office". In that case, the sender will automatically receive a default mail saying the recipient is not there.

Mentor: rdonkin at apache dot org

Complexity: medium

Task: The API and implementation must be defined (based on a mailet or not). The way the end-user will set/unset his "out-of-office" as the message that will be send must also be imagined (via James HUPA webmail for example). The SIEVE vacation extension (RFC 5230) is a good starting point but whether an actual implementation of this standard is attempted is negotiable.


GSOC2011 (last edited 2011-10-08 08:51:38 by IEugenStan)