How it works

This plugin checks for specific keywords in image/gif attachments, using gocr (an optical character recognition program).

This plugin can be used to detect spam that puts all the real spam content in an attached image. The mail itself only random text and random html, without any URL's or identifiable information.

Requirements

You will need convert (imagemagick) and gocr installed.

Installation

Save the two files below in your local configuration directory, adjusting the score in Ocr.cf as you like, and the wordlist (my @words =) in Ocr.pm according to the spam you are receiving. You might want to run gocr by hand on the image attachments to look for words that are correctly recognized.

Remarks

ToDo

-- Author: Maarten de Boer, mdeboer -at- iua -dot- upf -dot- edu

Changelog

Version 2:

Code

Ocr.cf

loadplugin Ocr Ocr.pm
body OCR eval:check_ocr()
describe OCR Check if text in attached images contains spam words
score OCR 3.0

Ocr.pm

# Ocr plugin, version 2
package Ocr;

use strict;
use Mail::SpamAssassin;
use Mail::SpamAssassin::Util;
use Mail::SpamAssassin::Plugin;

our @ISA = qw (Mail::SpamAssassin::Plugin);

# constructor: register the eval rule
sub new {
   my ( $class, $mailsa ) = @_;
   $class = ref($class) || $class;
   my $self = $class->SUPER::new($mailsa);
   bless( $self, $class );
   $self->register_eval_rule("check_ocr");
   return $self;
}

sub check_ocr {
   my ( $self, $pms ) = @_;
   my $cnt = 0;
   foreach my $p ( $pms->{msg}->find_parts("image") ) {
      my ( $ctype, $boundary, $charset, $name ) =
        Mail::SpamAssassin::Util::parse_content_type(
         $p->get_header('content-type') );
      if ( $ctype eq "image/gif" ) {
         open OCR, "|/usr/bin/convert -flatten - pnm:-|/usr/bin/gocr -i - > /tmp/spamassassin.ocr.$$";
         foreach $p ( $p->decode() ) {
            print OCR $p;
         }
         close OCR;
         open OCR, "/tmp/spamassassin.ocr.$$";
         my @words =
           ( 'company', 'money', 'stock', 'million', 'thousand', 'buy', 'price', 'don\'t' );
         while (<OCR>) {
            my $w;
            foreach $w (@words) {
               if (m/$w/i) {
                  $cnt++;
               }
            }
         }
         unlink "/tmp/spamassassin.ocr.$$";
      }
   }
   return ( $cnt > 1 );
}

1;

OcrPlugin (last edited 2009-09-20 23:16:53 by localhost)