Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Notes for Redhat/FC users: The packages libungif and libungif-progs should be installed to provide giffix. Notes for Debian users: The package libungif-bin provides giffix.

Changelog

Version 2.0: *

  • Replaced imagemagick with netpbm tools
  • Plugin invokes giffix now on gifs to handle intentionally corrupted gifs
  • Added png support
  • Added magic byte detection to detect correct file format independantly from content-type
  • Added 3 verbosity levels
  • Added configuration option for tmp file path and scores

Version 2.1:

Installation

  • Added scoring for wrong content-type
  • Added scoring for broken gif images
  • Added configuration for helper applications
  • Added autodisable_score feature to disable the OCR engine if the message has already enough points

Installation

Download the tarball (see How to Obtain) to your spamassassin configuration directory and unpack itSave the two files below in your local configuration directory. Open FuzzyOcr.cf and extend the wordlist as you wish.

...

focr_tmp_path - String determining the absolute path to a directory where the plugin may write temporary files to (without trailing slash)

focr_verbose - Verbose level (0 - 2). (1 is currently the default)

  • 0 means normal operation.
  • 1 means output all words and the corresponding measured distance in the rule output:
No Format

                        6.0 FUZZY_OCR        BODY: Mail contains an image with common spam text

...

 inside
                            Words found:
                            "viagra" with fuzz of 0.2
                            "cialis" with fuzz of 0
                            "viagra" with fuzz of 0.2
                            "levitra" with fuzz of 0
                            (4 word occurrences found)
  • 2 means same as 1 with an additional output of the text recognized by gocr in a file debug.<number>.focr in the local directory
    This file also contains the recognized format type in the first line (1 means gif, 2 jpeg, 3 png).

focr_bin_* - Tells the plugin about the helper applications, change to the full path + binary name if your applications are not found.

focr_wrongctype_score - Score to give for a wrong content-type (e.g. Image is GIF but content-type says image/jpeg)

focr_corrupt_score - Score to give for a corrupted image (Currently only used with GIF images)

focr_autodisable_score - If the message has already more points than this value, then the plugin will cancel all further OCR checking.

Example of work

Lets say you have defined focr_word investor in your configuration. Now you receive an image which, after converted and recognized gives you:

...

– Author: Christian Holler, decoder_at_own-hero_dot_net

Code

FuzzyOcr.cf

No Format

loadplugin FuzzyOcr FuzzyOcr.pm
body FUZZY_OCR eval:check_fuzzy_ocr()
describe FUZZY_OCR Mail contains an image with common spam text inside

# Here we defined the words to scan for

focr_word stock
focr_word investor
focr_word international
focr_word company
focr_word money
focr_word million
focr_word thousand
focr_word buy
focr_word price
focr_word trade
focr_word banking
focr_word service
focr_word kunde
focr_word volksbank
focr_word sparkasse
focr_word software
focr_word viagra
focr_word cialis
focr_word levitra
focr_word medicine
focr_word legal
focr_word medication
focr_word click here
focr_word penis
focr_word growth
focr_word drugs
focr_word pharmacy

# These parameters can be used to change other detection settings
#
# Detection treshold (see manual)
#focr_treshold 0.3
#
# This is the score for a hit after focr_counts_required matches
#focr_base_score 4
#
# This is the additional score for every additional match after focr_counts_required matches
#focr_add_score 1
#
# Number of minimum matches before the rule scores
#focr_counts_required 2
#
# Verbosity level (see manual)
#focr_verbose 2
#
# Path for temporary files
#focr_tmp_path "/tmp"

FuzzyOcr.pm

...

How to obtain

You can download the latest tarball containing the FuzzyOcr.pm and FuzzyOcr.cf from http://users.own-hero.net/~decoder/fuzzyocr/