Pimp my spamassassin / FuzzyOCR

Today I’ve implemented a new tool in our anti-spam system:
FuzzyOCR (Dec 13, 2007: URL contains ads only now)

It’s an OCR software used as a plugin for SpamAssassin.
OCR means “optical character recognition” and describes the procedure to recognize characters and words from images. It’s quite useful when you try to catch so-called “Image Spam”, which uses normal text where the real message is hidden in images (inline gifs, etc.)

The results are quite good and I’m confident : )

Additionally to the packages described on the homepage of FuzzyOCR you’ll need another piece of software (at least with openSuSE 10.0): giflib-progs-4.1.3-7.i586.rpm

Here you can see an example, that I’ve just recieved and that was recognized as spam correctly:
0.7 EXTRA_MPART_TYPE: Header has extraneous Content-type…
1.1 HTML_20_30 BODY: Message is 20% to 30% HTML
0.0 HTML_MESSAGE BODY: HTML included in message
0.0 BAYES_50 BODY: Bayesian spam probability is 40 to 60%
0.8 SARE_GIF_ATTACH FULL: Email has a inline gif
0.7 MY_CID_AND_STYLE SARE: cid and style
8.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside
Words found:
“viagra” in 1 lines
“cialis” in 1 lines
“xanax” in 1 lines
“valium” in 1 lines
“pharmacy” in 1 lines
(5 word occurrences found)

From: Antonia [mailto:ademakerzcxl@xxxx.xxx]
Sent: Wednesday, February 26, 2007 9:04 PM
To: xxxx@xxxx.xxx
Subject: *****SPAM***** How’s It Going