Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Quote about it being unmaintained for 2 years, from project website

Attention

"This project is UNMAINTAINED as of 2009-06-01. Use it at your own risk."

Informations on this page are out-of-date and need updating. Especially Requirements, Configuration and Installation Instructions might not be accurate anymore for current versions.

FuzzyOcr now has a webpage located at:

http://fuzzyocr.own-hero.net/

More up-to-date informations will be available there soon, installation instructions for current versions are shipped with the tarballs available there.

How it works

NOTE: This plugin is based on the OcrPlugin written by Maarten de Boer and was extended and improved.

...

You will need giftopnm, jpegtopnm and pngtopnm (from netpbm), imagemagick and gocr installed.

Additionally, you will need the perl module

...

Patches for the sources are to be found in the download directory of FuzzyOcr. Not using these can make problems under certain circumstances.

Notes for Fedora Core 5 (or higher) users: The package libungifgiflib-utils provides giffix. The package netpbm-progs provides giftopnm, etc.

Notes for other Redhat/FC users: The packages libungif and libungif-progs should be installed to provide giffix.

...

  • Several bugfixes
  • New debug system
  • Logfile support
  • Proper error handling for most errors

Version 2.3

  • Multiple scans with different pnm preprocessing and gocr arguments possible
  • Support for interlaced gifs
  • Support for animated gifs
  • Temporary file handling reorganized
  • External wordlist support
  • Personalized wordlist support
  • Spaces are now stripped from wordlist words and OCR results before matching
  • Experimental MD5 Database feature

Installation

Attention: If you need help installing this plugin or have other questions, please use the mailinglist created for this plugin .or contact me on IRC (see the end of this page for more informations)

It can be found at http://lists.own-hero.net/mailman/listinfo/devel-spam

Since version 2.3, the tarball contains an INSTALL file and a FAQ file. Both should be read for instructions installing it.

The following informations are a bit older and might not be accurate anymore for version 2.3. Most new parameters are not mentioned here anymore.

Download the tarball (see How to Obtain) to your spamassassin configuration directory and unpack it to /etc/mail/spamassassin/ (You may choose another location but all necessary adjustments to the configuration file are up to you then). Open FuzzyOcr.cf and extend the wordlist as you wish. If you have the helper binaries in a different location than the default in the config file specifies, then change these to the correct path.

...

The variable $countreq can be adjusted via the configuration file parameter focr_counts_required and indicates the number of matches that need to be found before any score will be triggered.

The variable $treshold $threshold is similarly adjusted with the configuration file parameter focr_tresholdthreshold. This is a float value between 0 and 1 and indicates the maximum relative edit distance between the wordlist word and the obfuscated version (less means the words need to be more similar, 0 means identical). The default of 0.3 normally does not need any change. Note that this module also matches substrings (see example).

Explanation of the additional options:

focr_tmp_path - String determining the absolute path to a directory where the plugin may write temporary files to (without trailing slash)focr_logfile - String determining the file to send log messages to. Make sure this is writable!

...

  • The case is not relevant
  • All special characters, spaces or numbers are stripped before any matching is done
  • Your wordlist word will be found even if it is inside another word (submatching)
  • The distance is calculated from the amount of character additions, deletions and substitutions, that need to be done.

...

  • The words checked for are specific for some spam I received a lot of recently.
  • gocr can take up quite a bit of resources, so be careful. But it is only executed for messages that contain gif, png or jpeg attachments.

ToDo

  • Rework animated gif handling
  • Replace plain MD5 database with a DBM file Avoid usage of tmp files for gocr, redirect output directly back to the script

– Author: Christian Holler, decoder_at_own-hero_dot_net

...