THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
This assumes your corpus is in Maildir format (file per message).
Make sure PATH includes $HOME/dspam/bin if installed there.
You can experiment with different learning methods. It's probably best to feed all manually verified messages first with --source=corpus. It's not an exact science, so mixing methods might come up with different FPs/FNs.
Learn the corpus (method 1):
No Format |
---|
# Always clear old data first
rm -rf $HOME/dspam/var
# This will learn the folders with --source=error
dspam_train $LOGNAME /path/to/spam /path/to/ham
|
Learn the corpus (method 2):
No Format |
---|
# Clear old data unless you are learning some additional corpus rm -rf $HOME/dspam/bin/dspam_train $LOGNAMEvar # Feed your folders with --source=corpus find /path/to/spam -type f | while read -r f; do dspam --user $LOGNAME --source=corpus --class=spam < "$f" done find /path/to/ham -type f | while read -r f; do dspam --user $LOGNAME --source=corpus --class=innocent < "$f" done |
Check the corpus:
No Format |
---|
/bin/bash find /path/to/spam -type f | while read -r f; do RESULT=$($HOME/dspam/bin/dspam --user $LOGNAME --classify < "$f") # Tune confidence >= 0.6 check if needed if [[ "$RESULT" =~ (result=\"Innocent\".*confidence=(1|0\.[6-9].)) ]]; then echo "$f ${BASH_REMATCH[1]}" fi done find /path/to/ham -type f | while read -r f; do RESULT=$($HOME/dspam/bin/dspam --user $LOGNAME --classify < "$f") # Tune confidence >= 0.6 check if needed if [[ "$RESULT" =~ (result=\"Spam\".*confidence=(1|0\.[6-9].)) ]]; then echo "$f ${BASH_REMATCH[1]}" fi done |
...