Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: version 2.0

...

This plugin checks for specific keywords in image/gif, image/jpeg or image/jpeg png attachments, using gocr (an optical character recognition program).

...

Requirements

You will need giftopnm, jpegtopnm and jpegtopnmpngtopnm (from netpbm) and gocr installed.

Additionally, you will need the perl module

No Format
 String::Approx 

and giffix (from giflib).

Changelog

Version 2.0: * Replaced imagemagick with netpbm tools

  • Plugin invokes giffix now on gifs to handle intentionally corrupted gifs
  • Added png support
  • Added magic byte detection to detect correct file format independantly from content-type
  • Added 3 verbosity levels
  • Added configuration option for tmp file path and scores

Installation

Save the two files below in your local configuration directory. Open FuzzyOcr.cf and extend the wordlist as you wish.

The scoring is dynamic, more word matches lead to a higher score. The scoring is done in FuzzyOcr.pm in line 74 (see ToDo) and is basically a fixed score for the first $countreq matches (default 4) + 1 Point for every additional match. This can be adjusted easily as you wishas soon as focr_counts_required matches were found. It scores exactly focr_base_score points then. For every additional match, it scores additionally focr_add_score points.

Attention: Do not add a score line to the config file. It will not be used! Scoring is done INTERNALLY and can only be configured with the two parameters described above.

The variable $countreq can be adjusted via the configuration file parameter focr_counts_required and indicates the number of matches that need to be found before any score will be triggered.

The variable $treshold is similarly adjusted with the configuration file parameter focr_treshold. This is a float value between 0 and 1 and indicates the maximum relative edit distance between the wordlist word and the obfuscated version (less means the words need to be more similar, 0 means identical). The default of 0.3 normally does not need any change. Note that this module also matches substrings (see example).

Explanation of the additional options:

focr_tmp_path - String determining the absolute path to a directory where the plugin may write temporary files to (without trailing slash)
focr_verbosity - Verbose level (0 - 2).

  • 0 means normal operation.
  • 1 means output all words and the corresponding measured distance in the rule output:
    6.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside
    Words found:
    "viagra" with fuzz of 0.2
    "cialis" with fuzz of 0
    "viagra" with fuzz of 0.2
    "levitra" with fuzz of 0
    (4 word occurrences found)
  • 2 means same as 1 with an additional output of the text recognized by gocr in a file debug.<number>.focr in the local directory
    This file also contains the recognized format type in the first line (1 means gif, 2 jpeg, 3 png).



Example of work

Lets say you have defined focr_word investor in your configuration. Now you receive an image which, after converted and recognized gives you:

...

  • The words checked for are specific for some spam I received a lot of recently.
  • gocr can take up quite a bit of resources, so be careful. But it is only executed for messages that contain gif, png or jpeg attachments.

ToDo

...

  • Avoid usage of tmp files for gocr, redirect output directly back to the script

...

No Format
loadplugin FuzzyOcr FuzzyOcr.pm
body FUZZY_OCR eval:check_fuzzy_ocr()
describe FUZZY_OCR Mail contains an image with common spam text inside

# Here we defined the words to scan for

focr_word stock
focr_word investor
focr_word international
focr_word company
focr_word money
focr_word million
focr_word thousand
focr_word buy
focr_word price
focr_word trade
focr_word banking
focr_word service
focr_word kunde
focr_word volksbank
focr_word sparkasse
focr_word software
focr_word viagra
focr_word cialis
focr_word levitra
focr_word medicine
focr_word legal
focr_word medication
focr_word click here
focr_word penis
focr_word growth
focr_word drugs
focr_word pharmacy

# These parameters can be used to change other detection settings
# Normally these don't need to be changed.
#
#focr_treshold 0.3
# Detection treshold (see manual)
#focr_treshold 0.3
#
# This is the score for a hit after focr_counts_required matches
#focr_base_score 4
#
# This is the additional score for every additional match after focr_counts_required matches
#focr_add_score 1
#
# Number of minimum matches before the rule scores
#focr_counts_required 2
#
# Verbosity level (see manual)
#focr_verbose 2
#
# Path for temporary files
#focr_tmp_path "/tmp"

FuzzyOcr.pm

No Format
# FuzzyOcr plugin, version 1 2.0
# Changelog:
#    version 2.0
#       Replaced imagemagick with netpbm
#       Invoke giffix to fix broken gifs before conversion
#       Support png images
#       Analyze the file to detect the format without content-type
#       Added several configuration parameters
#
#
# written by Christian Holler decoder_at_own-hero_dot_net

package FuzzyOcr;

use strict;
use Mail::SpamAssassin;
use Mail::SpamAssassin::Util;
use Mail::SpamAssassin::Plugin;

use String::Approx 'adistr';

our @ISA = qw (Mail::SpamAssassin::Plugin);

our @words = ( );

# Default values
our $cnt$treshold = "0.3";

# Default valuesour $base_score = "4";
our $treshold$add_score = "0.31";
our $countreq = 2;
our $verbose = 1;
our $tmppath = "/tmp";


# constructor: register the eval rule
sub new {
   my ( $class, $mailsa ) = @_;
   $class = ref($class) || $class;
   my $self = $class->SUPER::new($mailsa);
   bless( $self, $class );
   $self->register_eval_rule("check_fuzzy_ocr");
   return $self;
}

sub parse_config {
  my ($self, $opts) = @_;
  if ($opts->{key} eq "focr_word") {
        push(@words, $opts->{value});
  } elsif ($opts->{key} eq "focr_treshold") {
        $treshold = $opts->{value};
  } elsif ($opts->{key} eq "focr_base_score") {
        $base_score = $opts->{value};
  } elsif ($opts->{key} eq "focr_add_score") {
        $add_score = $opts->{value};
  } elsif ($opts->{key} eq "focr_counts_required") {
        $countreq = $opts->{value};
  } elsif ($opts->{key} eq "focr_verbose") {
        $verbose = $opts->{value};
  } elsif ($opts->{key} eq "focr_tmp_path") {
        $tmppath = $opts->{value};
  }
}

sub check_fuzzy_ocr {
   my ( $self, $pms ) = @_;
   my @found = ( );
   my $image_type = 0;
   my $cnt = 0;
   foreach my $p ( $pms->{msg}->find_parts("image") ) {
      my ( $ctype, $boundary, $charset, $name ) =
        Mail::SpamAssassin::Util::parse_content_type(
         $p->get_header('content-type') );
         if (($ctype eq=~ "/image/gif") || ($ctype eq "image/jpeg")) { {
         my $firstline = ($p->decode())[0];
         my $tempfile = $tmppath . "/" . "spamassassin.$$.focr";
         if ($ctype$firstline eq "image/gif"=~ /^\x47\x49\x46/) {
                $image_type = 1;
                open OCRIMAGE_PROCESSOR, "|/usr/bin/giffix | /usr/bin/giftopnm - |/usr/bin/gocr -i - > /tmp/spamassassin.focr.$$$tempfile";
         } else elsif ($firstline =~ /^\xff\xd8/) {
                $image_type = 2;
                open OCRIMAGE_PROCESSOR, "|/usr/bin/jpegtopnm - |/usr/bin/gocr -i - > /tmp/spamassassin.focr.$$"$tempfile";
         } elsif ($firstline =~ /^\x89\x50\x4e\x47/) {
                $image_type = 3;
                open IMAGE_PROCESSOR, "|/usr/bin/pngtopnm - |/usr/bin/gocr -i - > $tempfile";
         } else {
                $image_type = 0;
                print "No compatible file type detected... skipping image...\n";
                next;
         }
         foreach $p ( $p->decode() ) {
            print OCR $p;
   IMAGE_PROCESSOR $p;
         }
         if ($verbose > 1) {
                open DEBUG, ">debug.$$.focr";
                print DEBUG "File type: $image_type\n\n"
         }
         close OCRIMAGE_PROCESSOR;
         open OCR_DATA, "/tmp/spamassassin.focr.$$<$tempfile";
         while (<OCR><OCR_DATA>) {
            s/[^a-zA-Z ]//g;
            $_ = lc;
            if ($verbose > 1) {
                print DEBUG $_;
            }
            my $w;
            foreach $w (@words) {
                $w = lc $w;
                my $matched = adistr($w, $_);
                if (abs($matched) < $treshold) {
                        $cnt++;
                        if ($verbose > 0) {
                                push(@found, "\"$w\"" . " with fuzz of " . abs($matched));
                        }
                }
            }
         }
         close OCR_DATA;
         unlink "/tmp/spamassassin.focr.$$" $tempfile;
         if ($verbose > 1) {
                close DEBUG;
         }
      }
   }
   if ($cnt >= $countreq) {
         my $score = 4$base_score + ($cnt - $countreq) * $add_score;
         my $debuginfo = "";
         if ($verbose > 0) {
                $debuginfo = ("\nWords found:\n" . join("\n", @found) ."\n($cnt word occurrences found)");
         }
         $pms->_handle_hit("FUZZY_OCR", $score, "BODY: ", $pms->{conf}->{descriptions}->{FUZZY_OCR} ." ($cnt word occurrences found)") $debuginfo);
   }
   return 0;
}

1;