Page History

...

Social network talk:

pretty useless spamfiltering-wise at least; not any spam orientation at all

...

corpus analysis, from Hotmail's feedback loop
- volunteers classify random samples of their mail as spam or good; tens of thousands of hand-classified messages per day; large "unbiased" (???) sample of spam
additional analysis on two sets of spam:
- about a year between the two
- products sold, exploits used, trends
viagra types: 17% 2003, to 34% 2004
graphic porn down: 13% to 7%
exploits: increasing rapidly, 1.33 exploits 2003 to 1.73 in 2004
word obscuring: up to 20% in 2004
URL chaffing, adding good URLs to spam: not there in 2003, 10% in 2004 – anti-SURBL attack
Spammers are putting more work into each spam

Introducing the Enron Corpus:

1.3million messages originally; removed msgs with "integrity problems", replaced usernames etc
http://www-2.cs.cmu.edu/~enron
200,399 useful, non-dupe messages
158 messages, 1,268 msgs/user
missing message headers, so not much use for spam filtering; Exchange-mangled; no HTML. still, maybe good for "body" rules and FP avoidance
no mention how much of the corpus was spam

Larry Lessig:

extraordinary amount going to tech fixes; very little going to how the law could address it
compares govt attention to "pirate radio" creating static for large commercial stations, vs the spam problem
multiple types of regulators: the law, social norms, the market, and architecture (example: windows in lecture theatre are closed to enforce paying attention to speakers)
the law also regulates the other three
(that was the wrong talk! starts again!)
1. "regulation is always multiple modalities"
2. "interests will react"
3. "special interests defeat general interests"
in the old days, we had norms to defeat spam; that failed
using code to fix; so far that's failed
"the market will fix the problem"; ISPs trying to be the spam-free email provider; that's also failed
CAN-SPAM: totally failed – even displaced effective state legislation
not any single modality alone can fix it
regulation is a restriction, plus somebody to enforce it
CAN-SPAM: wanted truthful headers
opt-out doesn't provide any way for you to know if you've really been opted-out
enforcement: state AGs, ISPs, federal - centralised; too big though. they have better things to do with their time than bust spammers
solution: marries legal/architectural/market
legal: has two parts: (1) labels ("ADV" in the subject line)
(2) a bounty
(q: SEXUALLY-EXPLICIT tag is a label, already massively flouted by spammers. other labels would be flouted just as much.)
architecture: filter code then blocks mails with "ADV"
market: spammers would then have to incentivise people to receive their mail by sending offers they want (yeah right
enforcement: spam will only be sent if you can be paid, so "follow the money" – part of CAN-SPAM states "the business that benefits is responsible"
market in enforcement: bounty hunters who identify label-less spam (ah). amateurs, not law enforcement, large population
during CAN-SPAM development: labels were undesirable. Reason: "labels are too effective", because e.g. Amazon would have to have labelled their ads (because there was no distinction between opt-in and opt-out) and would be filtered
fundamental problem: corruption due to vested interests lobbying (cf CAN-SPAM)
sees difficulties in differentiating
q: tracing spam to the business that benefits often involves getting forwarding addresses from e.g. a CGI script running on a server in the Ukraine. *needs* law-enforcement power to get that IMO. a: "yes, and law-enforcement power is available, and jurisdiction problems are easy" (not sure about that! at least for the non-LE bounty-hunter case)
q: opt-in would have fixed it, like it has in Australia; but DMA keeps emasculating the laws into YOU-CAN-SPAM. a: agrees that there are multiple answers, but prefers not requiring opt-in across the board and uses the UCE definition as it allows political speech without adding to their costs. (I disagree, personally; the "UBE" definition works for me --jm)
Jon Praed: enforcement requires tremendous resources, and in some cases you've got to get to that IP address within 7 days to get those logs, with LE power. This is not easy. Notes that spammer margins are incredibly low, and those bounties as a result would be small and/or hard to get.
JP again: also suggests labels to label "good" commercial mail, personal mail, and then leave over "unknown" mail – which is then suspect. also suggests that the *headers* are the labelling, in reality.
q: "special interests always seem to wipe out general interest on this issue in laws. what can we do, law-wise?" "my brand is pessimism", "there was this moment, when they passed CAN-SPAM, when legislators were keen to fix it – then the special interests came in".
observation from audience: spots the parallel between UK and Pirate radio in the late 60's, which also passed a McCain anti-advertiser provision to deal with it.
Dave Crocker: believes that the suggestion would result in little real effect on spammers, and quite a heavy hit on legit businesses

Hal Varian:

"who is annoyed by spam?" one-fifth of US residents acknowledge buying products from spam; 77% considered spam an annoyance. That means 23-30% don't find it annoying! who are they?
before federal DNC: multiple lists, state, DMA, company-specific
federal DNC: lots more teeth; major fines, enforcement
mapped DNC lists (with last 4 digits redacted, obtained via FOIA) against census data
very popular in predom-asian areas
income under $10k/yr: very low prob of signup
income over $100k/yr: very high
lots more interesting correlations, too many to write up
income, education, number of kids are main significant determinators
almost no corr between having internet and signing up for the DNC!
est. signup rate for do-not-email: 31.5% (iirc)
summary: telemktg, spam annoy same people
70% of variation in signups can be expl'd with only 4 vars: median income, presence of teens, education, presence of state list that was merged into fed DNC
not many ppl used state lists, even though some were effective, and cheap. seemingly small costs can be big barriers
DNC: effective because it had teeth ($11k fine), lots of publicity, and nationwide.
q: "how do income and education correlate?" a: regression determines independent effects, accounts for correlation
q: "do tmers target upper-middle-class?" a: yes, very much! q: "different to spam then, since spam is a lot more scattershot" a: agreed. also note that targeting sometimes aimed at *less* middle-class consumers, on the basis that they want to sucker people sometimes
q on statistical reliability, answered, seems satisified
q: about getting the NPA data via FOIA: apparently Telcordia asks for a fortune for that! also wondered if he'd checked in Europe. a: not yet, interested in the idea
q: "what about using same techniques to find who benefits from spamming?" a: lots of interesting questions about looking at the spamming industry from an economic POV

Nicola Lugaresi: EU vs Spam - a legal response

why legislate? social norms, market, self-regulation, self-help, have all failed; "code has failed"; law wants it chance to fail
law probs: lack of jurisdiction and/or intl cooperation, lack of enforcement, lack of coord with other tools, bad laws
EU law: 3 main goals: practical: fight spam; ethical: protect privacy, and state its relevance; political: don't just trail the US, lead sometimes!
approach: optin and "soft opt-in" (transactional – in ctx of sale, similar products, same company; opp to object when email collected); no disguising identity; valid address to cease further comms
other tools: labels, registries, spam boxes, codes of conduct – not convinced by any of these
compared 2 antispam cases, one in Paris, a property case between spammer and AOL,MS; one in Napoli, a privacy case for an individual – E1000 damages and E750 costs paid to the individual!
spam definition in EU: not bulk, just direct mktg not just "commercial". too narrow in my opinion
not great against hard spam (proxy-abusers etc.): needs other approaches
q: worried about definition of spam narrowed to direct mktg, and not bulk: a: agreed
q: if I recall correctly not protecting corporate accounts, just individual "natural person" accounts at an ISP? a: not in Italian transposition at least, may be just in the Irish version (or my misreading
q: role accounts considered? a: his opinion, yes, in Italy, that's the case. good news

No-Email-Collection flag: Matthew Prince, Unspam LLC

Lessig-influenced presentation style – "law professors get paid based on the number of slides produced"
CAN-SPAM removes individual rights to sue spammers, one thing that's really been effective so far. not good news
proposes a new meta tag: no-email-collection
cute cat photo!
missing a huge chunk of the spam pipeline; we're focussing on the proxy-to-recipient part of the chain. focus on the address scraping!
definition of spam: hard. but harvesting email addresses: everyone agrees that's a no-no

Wiki Markup
<meta name="no-email-collection" terms="\[url of terms page\] />

http://www.unspam.com/noemailcollection
project honeypot: like my script – email cookie addresses for scrapers
create subdomains for honeypots and point them at Unspam's server! they'll collect the data
generating a public corpus!
code licensed under GPL
http://www.projecthoneypot.org/ , http://www.unspam.com/
q: "scraping through zombies?" a: yes, but that'll increase potential costs for spammers (hmm)
q: "two classes of spammer lists: resold email addrs as well as scraped"? yes, but getting one class is useful
q: "can a meta tag be enforceable? is a clickwrap license legally viable when it occurs between two computers?" a: if it's a community norm, that can improve legal viability; also CAN-SPAM specifically forbids scraping; also it may cause the spammer to think twice about this

Paula Bruening, CDT: Tech Responses to the Problem of Spam: Preserving Free Speech and Open Internet Values

CAN-SPAM not entirely working
worried about antispam tech hurting speech capabilities on the net
concerned that "only popular speech will be delivered"
what's key: tech must not be only part; devs must think of access issues; "let a thousand flowers bloom"
q: "spam filters won't block websites" a: yes, but urgent updates do require email
q: "is the CDT advocating political UBE?" a: good question, no answer

Barry Leiba, IBM: a multifaceted approach to spam filtering

cowritten with Nathaniel Borenstein (woo!)
"The Anti-Spam Gauntlet": describes exactly the SpamAssassin philosophy
cooperation required with others
open standards are required and are key to implementing anti-spam measures
(I suggested open source as well as open stds
Daniel: patents kill open standards; a from NSB: IBM are committed to "respecting IP rights" but not letting these stop open standards use by open source and other parties

SpamGuru:

"enterprise-class anti-spam filter", but aren't we all
centralized filter with personalized performance
includes a "Bulk Mail Manager" for outbound *bulk* mail, interesting
uses a "DNS analysis" step which sounds like it performs SPF checks
DNS and domain analysis: check open relays, reverse DNS lookups and static IP tables; mail from dyn IPs; recency of dom registration; probabilistic analysis of Received trail
bayes learning also feeds blacklist/whitelist; AWL is actually probabilistic
"plagiarism detection": signature based really: "fast analysis of common k-grams"; learns from few examples; almost guaranteed not to be a FP; high FN rate though
text classifier: Linear Discriminant: regularized linear classifier; approximates SVM
Chung-Kwei (which rocks): really really effective: 86% with < 0.01% FPs on their test corpus
test: corpus: 173k msgs, 130k spam, 42k good
spam defn: UCE (not UBE). cleaned repeatedly
combining algorithms: right with SpamAssassin dogma
nice graph of aggregated performance; 96% with < 0.01% FPs
SpamAssassin TODO: we need to add short-circuiting again!
http://www.research.ibm.com/spam
q: "what period, who were the 100 users?" a: users at IBM Watson
q: how do you get your "recency of domain registration" data? a: straight from WHOIS

Richard Clayton: Stopping Spam by Extrusion Detection

from demon.co.uk
ISPs can spot smarthost load going up, and suspect that there's a spammer active
insecure customers main problem for UK ISPs
ISP's real problem: blacklisting of IP ranges and smarthosts; rapid action is req'd
hard problem to solve: expensive to examine outgoing content; legal issues with blocking, and FP may cost you customers; volume is not good indicator!; "incorrect" sender domain doesn't indicate spam
solution: spot delivery failure errors (due to user unknown, remote blocks) in smarthost logs
heuristics: "too many" delivery failures (40/day sufficient); ignore "bounces" – have null <> return-path; ignore "mailing lists" (most dests work, few fail)
when first turned on, was finding 40 infected customers *per day*!
http://www.cl.cam.ac.uk/~rnc1/
q: "direct-to-MX spam? trapping port 25?" a: no we don't do that and don't mind about that, as much as spammers using our smarthost and getting that blocklisted
q: "sending outbound (or parts thereof) through SpamAssassin?" a: SpamAssassin is too expensive (in terms of load)
q: "hair-trigger nature of listing?" a: it's not automatic. there's always a manual verification, and it's usually very obvious at that step

Resisting Spam Delivery through TCP damping:

by default, TCP allows sender to control rate of flow; sender can achieve highest speed permitted by network
TCP damping tries to reduce net efficiency at the receiver side; more time, more bandwidth, more CPU cycles
low pain for recipients, high aggregated pain to spammers
need to do this at TCP layer; higher and lower aren't useful
even with tarproxy or similar, a smart spammer can blast the entire message to your TCP layer in one blat, even if you're tarpitting at the application layer
damping: increase sending time (delaying TCP packets); consume network bandwidth (request more packets)
increase delay: set adv_win = 0; fake congestion; delay outgoing ACKs (TCP conn terminates after 14 retries). cost at receiver: long idle TCP conn
increase bandwidth costs: request more retrans.; request more ACKs – reuse sequence numbers, use seqs that won't be used in this conn; send packets in reverse order. cost: about 1:1 ratio
used SpamAssassin at delivery time to estimate spamminess! mostly headers during early SMTP conversation, but you can use body rules before "250 Message Accepted for Delivery"
q: economics. "increases senders costs, but not a transfer to the recipient." a: there are no existing techniques to do this, and TCP damping must work in existing system.
q: if I was a spammer, and I figured out you were TCP damping, I'd ignore your advertised windows and blat entire message, hurting the network overall. a: sure, but hurting the spammer's bandwidth like this is worth it
q: but this encourages broken TCP implementations. a: but a broken TCP stack still won't get their spam delivered
q from John Levine: TurnTide does exactly this technique by narrowing the TCP window on the spammer's connections.
q: why not just use delayed ACKs? a: because it's not entirely as effective as the other techniques

AOL hashing:

I-Match: large corpus; lexicon generation
intersection of document and lexicon gives signature
trad I-Match lexicon generation: reject v frequent and hapaxes
use "Mutual Information" as a measurement of fitness to avoid overlapping rules
use multiple lexicons to avoid randomization from having an effect
generate multiple lexicons, by removing random entries from an original lexicon
also: distributional word clustering (Information Bottleneck) for lexicon selection (Terms with similar class distribution of P(spam|term))
q: "'cluster' selection" – is that reports from live users? yep
q: "FP rate?" a: very very low

Distributed, collaborative spam filtering:

TCD, yay
definition: "spam is email that the recipient is interested in receiving". we disagree, of course
P2P approach

Reputation network analysis for mail filtering:

75% of semweb data is FOAF files
using web of trust
a bit like http://web-o-trust.org/ , but not yet workable with email addrs since there's no spoofing protection

On attacking statistical spam filters:

spammers wanted to evade bayes
tokenization/obfuscation: turn out to be good spamsigns
should not have used SpamArchive spam, due to its lack of headers, in my opinion; headers improve spam recognition greatly
(correction: SpamArchive spam now does include headers, I missed that change – so that's not a big deal. Also, from talking to one author post-talk, he noted that they omitted the hdrs since the spam and ham each came from a different corpus, therefore a different set of hosts. If not ignored, those tokens would have been very obvious clues for the classifier.)
pretty similar to http://www.cs.dal.ca/research/techreports/2004/CS-2004-06.pdf

Child pages

Versions Compared

Old Version 5

New Version Current

Key