Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: [Original edit by JustinMason]

...

  • Code and corpus tests that for ramping up the probability for previously unseen tokens. This could be done either heuristically or by keeping real counts of unseen tokens in the Bayes token database. The idea is that words that have never been learned before get high probabilities.
  • Custom database file and code for faster performance and space savings (probably to be compared against qdbm and tdb since they look most promising right now as non-custom databases).
  • Bi-grams: that is, multi-word windowing as used in CRM-114, using two-word tokens (or possibly even higher). Not sure this will provide much higher accuracy now that spammers are using whole-phrase bayes poisoning, though. (JustinMason)
  • Wiki Markup
    Implementing Dobly noise-reduction - \[http://bugzilla.spamassassin.org/show_bug.cgi?id=3078 bug 3078\].
    \\
  • Wiki Markup
    Dynamically determining the autolearning thresholds based on incoming email rather than using hard-coded numbers.  See \[http://bugzilla.spamassassin.org/show_bug.cgi?id=1829 bug 1829\] for more.
    \\
  • Wiki Markup
    Looking for specific header tokens when they change location between the original message and the reply.  See \[http://bugzilla.spamassassin.org/show_bug.cgi?id=2129 bug 2129\] for more.
    \\

...