...
- MISSING_HEADERS: if a message doesn't have all the normal headers, such as From, To, and Subject, this will fire. Be sure to hand-verify any ham and spam messages that hit this to ensure that they're formatted correctly (in RFC-2822 format).
- NO_HEADERS_MESSAGE (or a combo of MISSING_HEADERS,MISSING_DATE,MISSING_SUBJECT in versions < 3.2.0): generally means you've got message without most of the important RFC-822 headers (often errors generated by MUAs/MDAs).
- EMPTY_MESSAGE: generally zero-length files, esp if accompanied by NO_RECEIVED.
- MISSING_HB_SEP: This is another danger sign, typically indicating that a header line has had a newline inserted incorrectly somehow, or an mbox "From" line has been inserted between RFC-822 headers.
- ANY_BOUNCE_MESSAGE: this indicates that the mail was a bounce message, a C/R challenge, or a "virus warning" from a broken scanner. These should be removed from both the ham and spam corpora, in general.
Other Corpus Cleaning Methods
DSPAM
DSPAM is well known standalone bayesian tool, you can crosscheck your corpus fast and easy with it.
It doesn't seem to be maintained anymore, here is probably the best version: https://github.com/ensc/dspam (download the master). If you are not comfortable compiling things, then you need to find some package.
Example how to build and install it simply in your home directory:
No Format |
---|
unzip master.zip && cd dspam-master
# autoconf/automake/gcc stuff obviously needed
./autogen.sh
./configure --prefix=$HOME/dspam --with-dspam-home=$HOME/dspam_data \
--disable-trusted-user-security --disable-syslog
make && make install
|
This assumes your corpus is in Maildir format (file per message).
Learn the corpus:
No Format |
---|
# Always clear old data first
rm -rf $HOME/dspam_data
$HOME/dspam/bin/dspam_train $LOGNAME /path/to/spam /path/to/ham
|
Check the corpus:
No Format |
---|
/bin/bash
find /path/to/spam -type f | while read -r f; do
RESULT=$(dspam --user $LOGNAME --classify < "$f")
# Tune confidence >= 0.6 check if needed
if [[ "$RESULT" =~ (result=\"Innocent\".*confidence=(1|0\.[6-9].)) ]]; then
echo "$f ${BASH_REMATCH[1]}"
fi
done
find /path/to/ham -type f | while read -r f; do
RESULT=$(dspam --user $LOGNAME --classify < "$f")
# Tune confidence >= 0.6 check if needed
if [[ "$RESULT" =~ (result=\"Spam\".*confidence=(1|0\.[6-9].)) ]]; then
echo "$f ${BASH_REMATCH[1]}"
fi
done
|
It will output list of messages to check. Move to correct folder if indeed in wrong place.
No Format |
---|
/path/to/spam/message123 result="Innocent"; class="Innocent"; probability=0.0000; confidence=0.73
/path/to/ham/message234 result="Spam"; class="Spam"; probability=0.0005; confidence=0.61
|
If you move stuff around a lot, do a new learn and check.
If it keeps reporting some messages wrong, you can script some whitelist method to ignore certain files etc.