FuzzyOcrPlugin

How it works

NOTE: This plugin is based on the OcrPlugin written by Maarten de Boer and was extended and improved.

This plugin checks for specific keywords in image/gif, image/jpeg or image/png attachments, using gocr (an optical character recognition program).

This plugin can be used to detect spam that puts all the real spam content in an attached image. The mail itself only random text and random html, without any URL's or identifiable information.

Additionally to the normal OcrPlugin, it can do approximate matches on words, so errors in recognition or attempts to obfuscate the text inside the image will not cause the detection to fail. Another improvement was to move the wordlist into the configuration file so it can be easily extended.

Requirements

You will need giftopnm, jpegtopnm and pngtopnm (from netpbm) and gocr installed.

Additionally, you will need the perl module

 String::Approx

and giffix (from giflib).

Notes for Redhat/FC users: The packages libungif and libungif-progs should be installed to provide giffix. Notes for Debian users: The package libungif-bin provides giffix.

Changelog

Version 2.0:

Replaced imagemagick with netpbm tools
Plugin invokes giffix now on gifs to handle intentionally corrupted gifs
Added png support
Added magic byte detection to detect correct file format independantly from content-type
Added 3 verbosity levels
Added configuration option for tmp file path and scores

Version 2.1:

Added scoring for wrong content-type
Added scoring for broken gif images
Added configuration for helper applications
Added autodisable_score feature to disable the OCR engine if the message has already enough points

Installation

Download the tarball (see How to Obtain) to your spamassassin configuration directory and unpack it. Open FuzzyOcr.cf and extend the wordlist as you wish.

The scoring is dynamic, more word matches lead to a higher score. The scoring is done as soon as focr_counts_required matches were found. It scores exactly focr_base_score points then. For every additional match, it scores additionally focr_add_score points.

Attention: Do not add a score line to the config file. It will not be used! Scoring is done INTERNALLY and can only be configured with the two parameters described above.

The variable $countreq can be adjusted via the configuration file parameter focr_counts_required and indicates the number of matches that need to be found before any score will be triggered.

The variable $treshold is similarly adjusted with the configuration file parameter focr_treshold. This is a float value between 0 and 1 and indicates the maximum relative edit distance between the wordlist word and the obfuscated version (less means the words need to be more similar, 0 means identical). The default of 0.3 normally does not need any change. Note that this module also matches substrings (see example).

Explanation of the additional options:

focr_tmp_path - String determining the absolute path to a directory where the plugin may write temporary files to (without trailing slash)

focr_verbose - Verbose level (0 - 2). (1 is currently the default)

0 means normal operation.
1 means output all words and the corresponding measured distance in the rule output:

                        6.0 FUZZY_OCR        BODY: Mail contains an image with common spam text inside
                            Words found:
                            "viagra" with fuzz of 0.2
                            "cialis" with fuzz of 0
                            "viagra" with fuzz of 0.2
                            "levitra" with fuzz of 0
                            (4 word occurrences found)

2 means same as 1 with an additional output of the text recognized by gocr in a file debug.<number>.focr in the local directory
This file also contains the recognized format type in the first line (1 means gif, 2 jpeg, 3 png).

focr_bin_* - Tells the plugin about the helper applications, change to the full path + binary name if your applications are not found.

focr_wrongctype_score - Score to give for a wrong content-type (e.g. Image is GIF but content-type says image/jpeg)

focr_corrupt_score - Score to give for a corrupted image (Currently only used with GIF images)

focr_autodisable_score - If the message has already more points than this value, then the plugin will cancel all further OCR checking.

Example of work

Lets say you have defined focr_word investor in your configuration. Now you receive an image which, after converted and recognized gives you:

ATTENTION ALL IN\lESTORS AND DAY TRADERS

Then the plugin will find the word investor. It would even succeed if the text was ATTENTION ALL STUPUDIN\lESTORSHAHA or INVSTORSZ etc.

Generally, the plugin follows these rules:

The case is not relevant
All special characters or numbers are stripped before any matching is done
Your wordlist word will be found even if it is inside another word (submatching)
The distance is calculated from the amount of character additions, deletions and substitutions, that need to be done.

Remarks

The words checked for are specific for some spam I received a lot of recently.
gocr can take up quite a bit of resources, so be careful. But it is only executed for messages that contain gif, png or jpeg attachments.

ToDo

Avoid usage of tmp files for gocr, redirect output directly back to the script

– Author: Christian Holler, decoder_at_own-hero_dot_net

How to obtain

You can download the latest tarball containing the FuzzyOcr.pm and FuzzyOcr.cf from http://users.own-hero.net/~decoder/fuzzyocr/

Child pages