Spam detection mailets using bayesian analysis techniques

BayesianAnalysis mailet

The BayesianAnalysis mailet scans a message and determines the probability that it is spam, using bayesian probability theory techniques.

It is based upon the principals described in A Plan For Spam (http://www.paulgraham.com/spam.html) by Paul Graham, and has been extended to his Better Bayesian Filtering (http://paulgraham.com/better.html).

The analysis capabilities are based on token frequencies (the corpus) learned through a training process using the BayesianAnalysisFeeder mailet (see below) and stored in a JDBC database. During mailet initialization the corpus is loaded (built) from the database and kept in memory.

After a training session, the corpus must be rebuilt from the database in order to acquire the new frequencies. Every 10 minutes a special thread will check if any change was made to the database by the feeder, and rebuild the corpus for this mailet if necessary.

A org.apache.james.spam.probability mail attribute will be created containing the computed spam probability as a java.lang.Double. A message header string named as specified in the headerName init parameter will be created containing such probability in floating point representation.

Initialization Parameters

The init parameters are as follows:

The probability of being spam is pre-pended to the subject if it is > 0.1 (10%).

The required tables are automatically created if not already there (see sqlResources.xml). The token field in both the ham and spam tables is case sensitive.

A James config.xml example

Here follows an example of config.xml definitions deploying the analysis mailet:

...

         <mailet match="All" class="BayesianAnalysis" onMailetException="ignore">
            <repositoryPath>db://maildb</repositoryPath>
            <maxSize>200000</maxSize>
            <headerName>X-MessageIsSpamProbability</headerName>
            <ignoreLocalSender>true</ignoreLocalSender>
         </mailet>
     
         <mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.90" class="AddHeader" onMatchException="noMatch">
            <name>X-MessageIsSpam</name>
            <value>true</value>
         </mailet>

         <mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.99" class="ToProcessor" onMatchException="noMatch">
            <processor> spam </processor>
            <notice>Spam not accepted</notice>
         </mailet>

...

BayesianAnalysisFeeder mailet

The BayesianAnalysisFeeder mailet feeds ham OR spam messages to train the BayesianAnalysis mailet (see above).

The new token frequencies are stored in a JDBC database.

The bayesian database tables are updated during the training reflecting the new data. At the end the mail is destroyed (ghosted).

The correct approach is to send the original ham/spam message as an attachment to another message sent to the feeder; all the headers of the enveloping message will be removed and only the original message's tokens will be analyzed and used for feeding. This because all the tokens of a message are examined by the BayesianAnalysis mailet (including headers), and hence the feeding process must be consistent.

After a training session, the frequency corpus used by the BayesianAnalysis mailet must be rebuilt from the database, in order to take advantage of the new token frequencies. Every 10 minutes a special thread in the BayesianAnalysis mailet will check if any change was made to the database, and rebuild its corpus if necessary.

Only one message at a time is scanned (the database update activity is synchronized) in order to avoid too much database locking, as thousands of rows may be updated just for one message being fed.

Initialization Parameters

The init parameters are as follows:

A James config.xml example

Here follows an example of config.xml definitions deploying the feeder mailet:

...

         <!-- "not spam" bayesian analysis feeder. -->
         <mailet match="RecipientIs=not.spam@thisdomain.com" class="BayesianAnalysisFeeder">
            <repositoryPath> db://maildb </repositoryPath>
            <feedType>ham</feedType>
            <maxSize>200000</maxSize>
         </mailet>

         <!-- "spam" bayesian analysis feeder. -->
         <mailet match="RecipientIs=spam@thisdomain.com" class="BayesianAnalysisFeeder">
            <repositoryPath> db://maildb </repositoryPath>
            <feedType>spam</feedType>
            <maxSize>200000</maxSize>
         </mailet>

...

The previous example will allow the user to send messages to the server and use the recipient email address as the indicator for whether the message is ham or spam.

Using the example above, send good messages (ham not spam) to the email address "not.spam@thisdomain.com" to pump good messages into the feeder, and send spam messages (spam not ham) to the email address "spam@thisdomain.com" to pump spam messages into the feeder. It is a good idea to activate SMTP AUTH and replace thisdomain.com with a domain not listed as a server in <servernames> in config.xml: this way only authenticated users can feed the corpus. An example of addresses to use could be "ham@bayes.feeder" and "spam@bayes.feeder".

Bayesian_Analysis (last edited 2009-09-20 22:58:10 by localhost)