Like a Bayesian learning system, SpamAssassin's GeneticAlgorithm requires a corpus of hand-classified mail. Our guidelines are (quoting "masses/CORPUS_POLICY"):
- hand-verified as "spam" and "ham" (non-spam) piles – *not* just classified using existing spam-classification algorithms (such as SpamAssassin itself). Note that it's fine to use SpamAssassin to pre-filter them into the right piles, just make sure you scan the results "by hand" afterwards to verify that SpamAssassin made the correct diagnosis in each case.
- containing a representative mix of ham mail – that includes commercial-sounding-but-not-spam messages, legitimate business discussion (which may include talk of "sales", "marketing", "offers" etc), or verified opt-in mail newsletters. This is a *very* important point!
- containing no old spam mail. Older spam uses different tricks and terminology, which will impact SpamAssassin's accuracy when it's filtering "live", new mail. Please try not to scan spam older than 6 months.
- cleaned of viruses, and forwarded spam messages. These will skew the results.
- and finally, cleaned of discussion of spam or virus messages or signatures (such as SpamAssassin-talk or bugtraq mailing list messages). Even though they are ham, these often contain snippets of code that incorrectly trigger tests, and again will skew the results. (Rewriting the tests to avoid triggering on SpamAssassin-talk messages is not realistic!)
Once you run "mass-check" on a corpus (MassCheck), see the instructions in "CORPUS_SUBMIT" for details of how to verify that the top scorers are not accidental spam that got through.