10-Fold Cross Validation
This is a log of what I did to run a 10-fold cross-validation test of the perceptron vs the GA when testing bug 2910, http://bugzilla.spamassassin.org/show_bug.cgi?id=2910 (-- JustinMason 21/01/04).
First, I checked out the source:
svn co https://svn.apache.org/repos/asf/incubator/spamassassin/trunk cd trunk perl Makefile.PL make cd masses
get pgapack and install as "masses/pgapack". I just scp'd in an already-built tree I had here.
use the set-0 logs from the 2.60 GA run – taken from the rsync repository:
wc -l /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log 210442 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log 354479 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
we want about 2k in each bucket, otherwise it'll take weeks to complete. use split-logs-into-buckets to juggle the log files in blocks of 10% to get the ratio and size to around 2k:2k.
ham buckets first:
./tenpass/split-log-into-buckets 10 < /home/corpus-rsync/corpus/Obsolete/submit-2.6 0-GA-run1/ham-set0.log mv split-1.log new ./tenpass/split-log-into-buckets 10 < new wc -l split-1.log 2104 split-1.log
much better!
mv split-*.log ../../logs/nonspam-jm/ ./tenpass/split-log-into-buckets 10 < /home/corpus-rsync/corpus/Obsolete/submit-2.6 0-GA-run1/spam-set0.log mv split-1.log new wc -l new 35437 new
given this, we want 6 of the 10 logfiles to make 21264 lines, which would result in a roughly even ham:spam ratio for testing. let's do that.
cat split-{1,2,3,4,5,6}.log > new ./tenpass/split-log-into-buckets 10 < new wc -l split-1.log 2126 split-1.log
perfect!
mv split-*.log ../../logs/spam-jm/
}}}
and doublecheck the log sizes:
wc -l ../../logs/*/*.log 2104 ../../logs/nonspam-jm/split-1.log 2103 ../../logs/nonspam-jm/split-10.log 2106 ../../logs/nonspam-jm/split-2.log 2103 ../../logs/nonspam-jm/split-3.log 2102 ../../logs/nonspam-jm/split-4.log 2105 ../../logs/nonspam-jm/split-5.log 2102 ../../logs/nonspam-jm/split-6.log 2103 ../../logs/nonspam-jm/split-7.log 2103 ../../logs/nonspam-jm/split-8.log 2104 ../../logs/nonspam-jm/split-9.log 2126 ../../logs/spam-jm/split-1.log 2127 ../../logs/spam-jm/split-10.log 2126 ../../logs/spam-jm/split-2.log 2126 ../../logs/spam-jm/split-3.log 2128 ../../logs/spam-jm/split-4.log 2126 ../../logs/spam-jm/split-5.log 2126 ../../logs/spam-jm/split-6.log 2126 ../../logs/spam-jm/split-7.log 2126 ../../logs/spam-jm/split-8.log 2125 ../../logs/spam-jm/split-9.log 42297 total
looks fine. now run the 10pass master script.
nohup sh -x ./tenpass/10pass-run &
Results will appear in "tenpass_results" – over the course of 4 days.
10-Fold Testing With The Perceptron Instead of GA
If all goes well, the Perceptron will take over from the GA as the main way we generate scores; in that case, this section will be obsolete.
copied ./tenpass/10pass-run to ./10pass-run-perceptron .
Changed these lines:
make clean >> make.output make >> make.output 2>&1 ./evolve pwd; date
to
make clean >> make.output make -C perceptron_c clean >> make.output make tmp/tests.h >> make.output 2>&1 rm -rf perceptron_c/tmp; cp -r tmp perceptron_c/tmp make -C perceptron_c >> make.output ( cd perceptron_c ; ./perceptron -p 0.75 -e 100 ) pwd; date
Change
cp craig-evolve.scores tenpass_results/scores.$id
to
perl -pe 's/^(score\s+\S+\s+)0\s+/$1/gs;' \ < perceptron_c/perceptron.scores \ > tenpass_results/scores.$id
(required to work around an extra digit output by the perceptron app) and run ./10pass-run-perceptron . This one completes a lot more quickly