10-Fold Cross Validation
This is a log of what I did to run a 10-fold cross - validation test of the perceptron vs the GA when testing bug 2910, http://bugzilla.spamassassin.org/show_bug.cgi?id=2910 (-- JustinMason 21/01/04).
First, I checked out the source:
No Format |
---|
svn co https://svn.apache.org/repos/asf/incubator/spamassassin/trunk
cd trunk
perl Makefile.PL
make
cd masses
|
get pgapack and install as "masses/pgapack". I just scp'd in an already-built tree I had here.
use the set-0 logs from the 2.60 GA run – taken from the rsync repository:
No Format |
---|
wc -l /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log \
/home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
210442 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log
354479 home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
|
we want about 2k in each bucket, otherwise it'll take weeks to complete. use split-logs-into-buckets (see SplitLogsIntoBuckets) to juggle the log files in blocks of 10% to get the ratio and size to around 2k:2k.
ham buckets first:
No Format |
---|
./tenpass/split-log-into-buckets 10 \
< /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log
mv split-1.log new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
2104 split-1.log
|
much better!
No Format |
---|
mv split-*.log ../../logs/nonspam-jm/
./tenpass/split-log-into-buckets 10 \
< /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
mv split-1.log new
wc -l new
35437 new
|
given this, we want 6 of the 10 logfiles to make 21264 lines, which would result in a roughly even ham:spam ratio for testing. let's do that.
No Format |
---|
cat split-{1,2,3,4,5,6}.log > new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
2126 split-1.log
|
perfect!
No Format |
---|
mv split-*.log ../../logs/spam-jm/
|
and doublecheck the log sizes:
No Format |
---|
wc -l ../../logs/*/*.log
2104 ../../logs/nonspam-jm/split-1.log
2103 ../../logs/nonspam-jm/split-10.log
2106 ../../logs/nonspam-jm/split-2.log
2103 ../../logs/nonspam-jm/split-3.log
2102 ../../logs/nonspam-jm/split-4.log
2105 ../../logs/nonspam-jm/split-5.log
2102 ../../logs/nonspam-jm/split-6.log
2103 ../../logs/nonspam-jm/split-7.log
2103 ../../logs/nonspam-jm/split-8.log
2104 ../../logs/nonspam-jm/split-9.log
2126 ../../logs/spam-jm/split-1.log
2127 ../../logs/spam-jm/split-10.log
2126 ../../logs/spam-jm/split-2.log
2126 ../../logs/spam-jm/split-3.log
2128 ../../logs/spam-jm/split-4.log
2126 ../../logs/spam-jm/split-5.log
2126 ../../logs/spam-jm/split-6.log
2126 ../../logs/spam-jm/split-7.log
2126 ../../logs/spam-jm/split-8.log
2125 ../../logs/spam-jm/split-9.log
42297 total
|
looks fine.
Next step was to ensure the scores in ../rules are usable for the GA. I did this by grepping out all the rules that had scores of 0 in the tested score-set (score-set 0), and creating a new file called ../rules/99_ga_workaround_scores.cf containing lines like so:
No Format |
---|
score NAME_OF_RULE_1 0.0001
score NAME_OF_RULE_2 0.0001
score NAME_OF_RULE_3 0.0001
...
|
This is necessary so that the mass-check log files, which were generated using the same ruleset but possibly with some rules enabled where they are now disabled, will still be useful; this way, all rules in the ruleset are again enabled, and will be considered by the GA as candidates for evolution.
now run the 10pass master script.
No Format |
---|
nohup sh -x ./tenpass/10pass-run &
|
Results will appear in "tenpass_results" – over the course of 4 days.
These will be:
- scores.{1 .. 10}: scores and GA accuracy ratings output by GA
- {ham,spam}.log.{1 .. 10}: validation log files for that set of scores
To perform the validation step, run
No Format |
---|
./tenpass/10pass-compute-tcr
|
This will compute an accuracy rating, using those scores and those validation log files, for the 10 folds. Output looks like:
No Format |
---|
# TCR: 14.173333 SpamRecall: 96.002% SpamPrec: 99.367% FP: 0.31% FN: 2.01%
# TCR: 13.986842 SpamRecall: 96.143% SpamPrec: 99.320% FP: 0.33% FN: 1.94%
# TCR: 15.865672 SpamRecall: 95.579% SpamPrec: 99.608% FP: 0.19% FN: 2.22%
# TCR: 14.173333 SpamRecall: 95.532% SpamPrec: 99.461% FP: 0.26% FN: 2.25%
# TCR: 15.748148 SpamRecall: 95.532% SpamPrec: 99.608% FP: 0.19% FN: 2.25%
# TCR: 12.807229 SpamRecall: 95.014% SpamPrec: 99.409% FP: 0.28% FN: 2.51%
# TCR: 14.561644 SpamRecall: 94.779% SpamPrec: 99.654% FP: 0.17% FN: 2.63%
# TCR: 12.432749 SpamRecall: 94.309% SpamPrec: 99.504% FP: 0.24% FN: 2.86%
# TCR: 14.358108 SpamRecall: 95.859% SpamPrec: 99.414% FP: 0.28% FN: 2.08%
# TCR: 18.318966 SpamRecall: 95.953% SpamPrec: 99.707% FP: 0.14% FN: 2.03%
|
These figures can be compared with other 10FCV runs; they're a good measurement of training accuracy. In other words, they're what you came for.
10-Fold Testing With The Perceptron Instead of GA
If all goes well, the Perceptron will take over from the GA as the main way we generate scores; in that case, this section will be obsolete.
copied ./tenpass/10pass-run to ./10pass-run-perceptron .
Changed these lines:
No Format |
---|
make clean >> make.output
make >> make.output 2>&1
./evolve
pwd; date
|
to
No Format |
---|
make clean >> make.output
make -C perceptron_c clean >> make.output
make tmp/tests.h >> make.output 2>&1
rm -rf perceptron_c/tmp; cp -r tmp perceptron_c/tmp
make -C perceptron_c >> make.output
( cd perceptron_c ; ./perceptron -p 0.75 -e 100 )
pwd; date
|
Change
No Format |
---|
cp craig-evolve.scores tenpass_results/scores.$id
|
to
No Format |
---|
perl -pe 's/^(score\s+\S+\s+)0\s+/$1/gs;' \
< perceptron_c/perceptron.scores \
> tenpass_results/scores.$id
|
(abbreviated "10FCV") is a system for testing trained classifiers. We use it in SpamAssassin development and QA.
The comp.ai.neural-nets FAQ covers it well, in http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html :
No Format |
---|
Cross-validation
++++++++++++++++
In k-fold cross-validation, you divide the data into k subsets of
(approximately) equal size. You train the net k times, each time leaving
out one of the subsets from training, but using only the omitted subset to
compute whatever error criterion interests you. If k equals the sample
size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a
more elaborate and expensive version of cross-validation that involves
leaving out all possible subsets of v cases.
|
In SpamAssassin, we generally use k=10 – as pretty much everyone else does anyway – and we use 10FCV to test:
- new tweaks to the "Bayesian" learning classifier (the BAYES_* rules)
- new tweaks to the rescoring system (which is also a learning classifier, just at a higher level).
Traditionally, k-fold cross-validation uses a "train on k-1 folds, test on 1 fold"; we use that for testing our rescoring system. However, for the BAYES rules, we use "train on 1 fold, test on k-1 folds", as otherwise it can be hard to get a meaningful number of false positives and false negatives to be able to distinguish improvements in accuracy, because that classifier is very accurate when sufficiently trained.
See RescoreTenFcv for a log of a sample 10-fold CV run against two SpamAssassin rescoring systems (the GA and the perceptron).(required to work around an extra digit output by the perceptron app) and run ./10pass-run-perceptron . This one completes a lot more quickly