Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: More appropriate header size for "Other ideas"

...

  • Code and corpus tests that for ramping up the probability for previously unseen tokens. This could be done either heuristically or by keeping real counts of unseen tokens in the Bayes token database. The idea is that words that have never been learned before get high probabilities.
  • Custom database file and code for faster performance and space savings (probably to be compared against qdbm and tdb since they look most promising right now as non-custom databases).
  • Bi-grams: that is, multi-word windowing as used in CRM-114, using two-word tokens (or possibly even higher). Not sure this will provide much higher accuracy now that spammers are using whole-phrase bayes poisoning, though. (JustinMason)
  • Implementing Dobly noise-reduction - bug 3078.
  • Dynamically determining the autolearning thresholds based on incoming email rather than using hard-coded numbers. See bug 1829 for more.

Other ideas

  • Translation : translation of rule descriptions, the manual, the website in other languages
  • Feedback button : client side button to enable a one touch feedback for users to recategorized a message (false positive or negative to correct state). Relevant page from the Anti-Spam Research Group (part of a sister organization to the one that creates RFCs).

...