10-Fold Cross Validation
This is a log of what I did to run a 10-fold cross -validation test of the perceptron vs the GA when testing bug 2910 ( validation (abbreviated "10FCV") is a system for testing trained classifiers. We use it in SpamAssassin development and QA.
The comp.ai.neural-nets FAQ covers it well, in http://bugzillawww.spamassassinfaqs.org/show_bug.cgi?id=2910 ) – JustinMason 21/01/04
Wiki Markup |
---|
\[check it out:\] |
No Format |
---|
svn co https://svn.apache.org/repos/asf/incubator/spamassassin/trunk
cd trunk
perl Makefile.PL
make
cd masses
|
Wiki Markup |
---|
\[also get pgapack and install as "masses/pgapack". I just scp'd in an already-built tree I had here.\] |
Wiki Markup |
---|
\[and use the set-0 logs from the 2.60 GA run -- taken from the rsync repository:\] |
No Format |
---|
wc -l /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
210442 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log
354479 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
|
Wiki Markup |
---|
\[we want about 2k in each bucket, otherwise it'll take weeks to complete. use split-logs-into-buckets to juggle the log files in blocks of 10% to get the ratio and size to around 2k:2k.\] |
Wiki Markup |
---|
\[ham buckets first:\] |
No Format |
---|
./tenpass/split-log-into-buckets 10 < /home/corpus-rsync/corpus/Obsolete/submit-2.6
0-GA-run1/ham-set0.log
mv split-1.log new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
2104 split-1.log
|
Wiki Markup |
---|
\[much better!\] |
No Format |
---|
mv split-*.log ../../logs/nonspam-jm/
./tenpass/split-log-into-buckets 10 < /home/corpus-rsync/corpus/Obsolete/submit-2.6
0-GA-run1/spam-set0.log
mv split-1.log new
wc -l new
35437 new
|
Wiki Markup |
---|
\[given this, we want 6 of the 10 logfiles to make 21264 lines, which would result in a roughly even ham:spam ratio for testing. let's do that.\] |
No Format |
---|
cat split-{1,2,3,4,5,6}.log > new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
2126 split-1.log
|
mv split-*.log ../../logs/spam-jm/
}}}
Wiki Markup |
---|
\[and doublecheck the log sizes:\] |
No Format |
---|
wc -l ../../logs/*/*.log
2104 ../../logs/nonspam-jm/split-1.log
2103 ../../logs/nonspam-jm/split-10.log
2106 ../../logs/nonspam-jm/split-2.log
2103 ../../logs/nonspam-jm/split-3.log
2102 ../../logs/nonspam-jm/split-4.log
2105 ../../logs/nonspam-jm/split-5.log
2102 ../../logs/nonspam-jm/split-6.log
2103 ../../logs/nonspam-jm/split-7.log
2103 ../../logs/nonspam-jm/split-8.log
2104 ../../logs/nonspam-jm/split-9.log
2126 ../../logs/spam-jm/split-1.log
2127 ../../logs/spam-jm/split-10.log
2126 ../../logs/spam-jm/split-2.log
2126 ../../logs/spam-jm/split-3.log
2128 ../../logs/spam-jm/split-4.log
2126 ../../logs/spam-jm/split-5.log
2126 ../../logs/spam-jm/split-6.log
2126 ../../logs/spam-jm/split-7.log
2126 ../../logs/spam-jm/split-8.log
2125 ../../logs/spam-jm/split-9.log
42297 total
|
Wiki Markup |
---|
\[looks fine. now run the 10pass master script.\] |
No Format |
---|
nohup sh -x ./tenpass/10pass-run &
|
Results will appear in "tenpass_results" – over the course of 4 days.
THE PERCEPTRON:
copied ./tenpass/10pass-run to ./10pass-run-perceptron .
Changed these lines:
No Format |
---|
make clean >> make.output
make >> make.output 2>&1
./evolve
pwd; date
|
to
No Format |
---|
make clean >> make.output
make -C perceptron_c clean >> make.output
make tmp/tests.h >> make.output 2>&1
rm -rf perceptron_c/tmp; cp -r tmp perceptron_c/tmp
make -C perceptron_c >> make.output
( cd perceptron_c ; ./perceptron )
pwd; date
|
Change
No Format |
---|
cp craig-evolve.scores tenpass_results/scores.$id
|
to
No Format |
---|
cp perceptron_c/perceptron.scores tenpass_results/scores.$id
|
faqs/ai-faq/neural-nets/part3/section-12.html :
No Format |
---|
Cross-validation
++++++++++++++++
In k-fold cross-validation, you divide the data into k subsets of
(approximately) equal size. You train the net k times, each time leaving
out one of the subsets from training, but using only the omitted subset to
compute whatever error criterion interests you. If k equals the sample
size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a
more elaborate and expensive version of cross-validation that involves
leaving out all possible subsets of v cases.
|
In other words, take a testing corpus, divided into ham and spam; each message has previously been hand-verified as being of the correct type (e.g. ham if it's in the ham corpus, spam if in the other one). Divide each corpus into k folds. (In SpamAssassin, we generally use k=10 – which is what pretty much everyone else does anyway, it just seems to work well . Then run these 10 tests:
No Format |
---|
Train classifier on folds: 2 3 4 5 6 7 8 9 10; Test against fold: 1
Train classifier on folds: 1 3 4 5 6 7 8 9 10; Test against fold: 2
Train classifier on folds: 1 2 4 5 6 7 8 9 10; Test against fold: 3
Train classifier on folds: 1 2 3 5 6 7 8 9 10; Test against fold: 4
Train classifier on folds: 1 2 3 4 6 7 8 9 10; Test against fold: 5
Train classifier on folds: 1 2 3 4 5 7 8 9 10; Test against fold: 6
Train classifier on folds: 1 2 3 4 5 6 8 9 10; Test against fold: 7
Train classifier on folds: 1 2 3 4 5 6 7 9 10; Test against fold: 8
Train classifier on folds: 1 2 3 4 5 6 7 8 10; Test against fold: 9
Train classifier on folds: 1 2 3 4 5 6 7 8 9; Test against fold: 10
|
We use 10FCV to test:
- new tweaks to the "Bayesian" learning classifier (the BAYES_* rules)
- new tweaks to the rescoring system (which is also a learning classifier, just at a higher level).
Traditionally, k-fold cross-validation uses a "train on k-1 folds, test on 1 fold"; we use that for testing our rescoring system. However, for the BAYES rules, we use "train on 1 fold, test on k-1 folds", as otherwise it can be hard to get a meaningful number of false positives and false negatives to be able to distinguish improvements in accuracy, because that classifier is very accurate when sufficiently trained.
So, for example,
See RescoreTenFcv for a log of a sample 10-fold CV run against two SpamAssassin rescoring systems (the GA and the perceptron).and run ./10pass-run-perceptron . This one runs quicker