THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
- by default, TCP allows sender to control rate of flow; sender can achieve highest speed permitted by network
- TCP damping tries to reduce net efficiency at the receiver side; more time, more bandwidth, more CPU cycles
- low pain for recipients, high aggregated pain to spammers
- need to do this at TCP layer; higher and lower aren't useful
- even with tarproxy or similar, a smart spammer can blast the entire message to your TCP layer in one blat, even if you're tarpitting at the application layer
- damping: increase sending time (delaying TCP packets); consume network bandwidth (request more packets)
- increase delay: set adv_win = 0; fake congestion; delay outgoing ACKs (TCP conn terminates after 14 retries). cost at receiver: long idle TCP conn
- increase bandwidth costs: request more retrans.; request more ACKs – reuse sequence numbers, use seqs that won't be used in this conn; send packets in reverse order. cost: about 1:1 ratio
- used SpamAssassin at delivery time to estimate spamminess! mostly headers during early SMTP conversation, but you can use body rules before "250 Message Accepted for Delivery"
- q: economics. "increases senders costs, but not a transfer to the recipient." a: there are no existing techniques to do this, and TCP damping must work in existing system.
- q: if I was a spammer, and I figured out you were TCP damping, I'd ignore your advertised windows and blat entire message, hurting the network overall. a: sure, but hurting the spammer's bandwidth like this is worth it
- q: but this encourages broken TCP implementations. a: but a broken TCP stack still won't get their spam delivered
- q from John Levine: TurnTide does exactly this technique by narrowing the TCP window on the spammer's connections.
- q: why not just use delayed ACKs? a: because it's not entirely as effective as the other techniques
AOL hashing:
- I-Match: large corpus; lexicon generation
- intersection of document and lexicon gives signature
- trad I-Match lexicon generation: reject v frequent and hapaxes
- use "Mutual Information" as a measurement of fitness to avoid overlapping rules
- use multiple lexicons to avoid randomization from having an effect
- generate multiple lexicons, by removing random entries from an original lexicon
- also: distributional word clustering (Information Bottleneck) for lexicon selection (Terms with similar class distribution of P(spam|term))
- q: "'cluster' selection" – is that reports from live users? yep
- q: "FP rate?" a: very very low
Distributed, collaborative spam filtering:
- TCD, yay
- definition: "spam is email that the recipient is interested in receiving". we disagree, of course
- P2P approach
Reputation network analysis for mail filtering:
- 75% of semweb data is FOAF files
- using web of trust
- a bit like http://web-o-trust.org/ , but not yet workable with email addrs since there's no spoofing protection
On attacking statistical spam filters:
- spammers wanted to evade bayes
- tokenization/obfuscation: turn out to be good spamsigns
- should not have used SpamArchive spam, due to its lack of headers, in my opinion; headers improve spam recognition greatly
- pretty similar to http://www.cs.dal.ca/research/techreports/2004/CS-2004-06.pdf