Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migration of unmigrated content due to installation of a new plugin


  • Lessig-influenced presentation style – "law professors get paid based on the number of slides produced"
  • CAN-SPAM removes individual rights to sue spammers, one thing that's really been effective so far. not good news
  • proposes a new meta tag: no-email-collection
  • cute cat photo!
  • missing a huge chunk of the spam pipeline; we're focussing on the proxy-to-recipient part of the chain. focus on the address scraping!
  • definition of spam: hard. but harvesting email addresses: everyone agrees that's a no-no
  • Wiki Markup
    <meta name="no-email-collection" terms="\[url of terms page\] />

  • project honeypot: like my script – email cookie addresses for scrapers (wink)
  • create subdomains for honeypots and point them at Unspam's server! they'll collect the data
  • generating a public corpus!
  • code licensed under GPL
  • ,
  • q: "scraping through zombies?" a: yes, but that'll increase potential costs for spammers (hmm)
  • q: "two classes of spammer lists: resold email addrs as well as scraped"? yes, but getting one class is useful
  • q: "can a meta tag be enforceable? is a clickwrap license legally viable when it occurs between two computers?" a: if it's a community norm, that can improve legal viability; also CAN-SPAM specifically forbids scraping; also it may cause the spammer to think twice about this


  • by default, TCP allows sender to control rate of flow; sender can achieve highest speed permitted by network
  • TCP damping tries to reduce net efficiency at the receiver side; more time, more bandwidth, more CPU cycles
  • low pain for recipients, high aggregated pain to spammers
  • need to do this at TCP layer; higher and lower aren't useful
  • even with tarproxy or similar, a smart spammer can blast the entire message to your TCP layer in one blat, even if you're tarpitting at the application layer
  • damping: increase sending time (delaying TCP packets); consume network bandwidth (request more packets)
  • increase delay: set adv_win = 0; fake congestion; delay outgoing ACKs (TCP conn terminates after 14 retries). cost at receiver: long idle TCP conn
  • increase bandwidth costs: request more retrans.; request more ACKs – reuse sequence numbers, use seqs that won't be used in this conn; send packets in reverse order. cost: about 1:1 ratio
  • used SpamAssassin at delivery time to estimate spamminess! mostly headers during early SMTP conversation, but you can use body rules before "250 Message Accepted for Delivery"
  • q: economics. "increases senders costs, but not a transfer to the recipient." a: there are no existing techniques to do this, and TCP damping must work in existing system.
  • q: if I was a spammer, and I figured out you were TCP damping, I'd ignore your advertised windows and blat entire message, hurting the network overall. a: sure, but hurting the spammer's bandwidth like this is worth it
  • q: but this encourages broken TCP implementations. a: but a broken TCP stack still won't get their spam delivered
  • q from John Levine: TurnTide does exactly this technique by narrowing the TCP window on the spammer's connections.
  • q: why not just use delayed ACKs? a: because it's not entirely as effective as the other techniques

AOL hashing:

  • I-Match: large corpus; lexicon generation
  • intersection of document and lexicon gives signature
  • trad I-Match lexicon generation: reject v frequent and hapaxes
  • use "Mutual Information" as a measurement of fitness to avoid overlapping rules
  • use multiple lexicons to avoid randomization from having an effect
  • generate multiple lexicons, by removing random entries from an original lexicon
  • also: distributional word clustering (Information Bottleneck) for lexicon selection (Terms with similar class distribution of P(spam|term))
  • q: "'cluster' selection" – is that reports from live users? yep
  • q: "FP rate?" a: very very low

Distributed, collaborative spam filtering:

  • TCD, yay
  • definition: "spam is email that the recipient is interested in receiving". we disagree, of course (wink)
  • P2P approach

Reputation network analysis for mail filtering:

  • 75% of semweb data is FOAF files
  • using web of trust
  • a bit like , but not yet workable with email addrs since there's no spoofing protection

On attacking statistical spam filters:

  • spammers wanted to evade bayes
  • tokenization/obfuscation: turn out to be good spamsigns
  • should not have used SpamArchive spam, due to its lack of headers, in my opinion; headers improve spam recognition greatly
  • (correction: SpamArchive spam now does include headers, I missed that change – so that's not a big deal. Also, from talking to one author post-talk, he noted that they omitted the hdrs since the spam and ham each came from a different corpus, therefore a different set of hosts. If not ignored, those tokens would have been very obvious clues for the classifier.)
  • pretty similar to (wink)