Speakers at the 2003 Spam Conference. (http://spamconference.org) Sparse Binary Polynomial Hash Message Filtering and The CRM114 Discriminator Bill Yerazunis, MERL Bayesian statistical mail filtering to sort incoming messages into valid and invalid sets is rapidly becoming the method of choice for single-ended antispam filters. In this talk, we will examine the Sparse Binary Polynomial Hash filtering technique, a generalization of the Bayesian method that can match mutating phrases as well as single words. As implemented in the GPLware "CRM114 Discriminator", and combined with supervised ADABOOST learning, SBPH can deliver >99.5% accuracy on real-time email without whitelists or blacklists, from as little as 250K of example text. A short explanation of the CRM114 Discriminator language will be included. __________________________________________________ Adaptive Spam Filtering Jason Rennie, MIT AI Lab Spam is a rampant problem with an annoying characteristic. As quickly as heuristics are developed to spot spam, the spammers change their tactics. Current systems all have one hole that spammers love to sneak through: they can't adapt. Hand-crafted rule-based classifiers have static rules. Bayesian approaches use static pre-processing that ignores "!!!!" and/or Japanese (for example). We need a new approach---a way to dynamically learn patterns that can identify spam. I describe one such approach: spam filtering as a compression problem. Given a set of e-mails and their labels (spam/non-spam), the objective is to encode a program for identifying spam in less space than it takes to encode the labels. In conjunction with a classification algorithm, this framework provides a natural way to score patterns. As part of a spam filtering system, it can be used to adapt the set of features used for labeling e-mail as spam. I describe the rationale for this approach and give examples of its performance on real data. Work done in conjunction with Tommi Jaakkola. __________________________________________________ Following Their Patterns John Draper, ShopIP I've spent considerable time tracking specific spammers, to try and get an idea of how they operate. Using the Crunchbox security system, we've been able to track them (almost in real time), keeping track of the times it arrives in our mailbox, studying patterns. They are not as consistent as I had hoped, though some are incredibly persistent, almost the point of harassment. We are writing "snort" rules, which the Crunchbox instantly triggers when any specific spam we are looking for comes in our network. __________________________________________________ The Spammers' Compendium Dr John Graham-Cumming, POPFile POPFile is a POP3 proxy that performs Naive Bayes classification of email into an arbitrary number of classes. POPFile and other Naive Bayes-based email classifiers have become very popular for spam fighting. And the spammers know it. I'll present a compendium of spammers tricks designed to confuse and bypass keyword and text analysis filtering programs and present the state of the POPFile art in handling these tricks. Come learn about Invisible Ink, Speaking in Tongues, Slice and Dice and many other games you can play with email and HTML. __________________________________________________ The Case for Spam Research Infrastructures Paul Judge, CipherTrust The scale and effect of the spam epidemic leads us to suggest that spam is no longer simply a nuisance, but is a type of information security problem. Therefore, we encourage systematic efforts to understand and analyze the problem and propose solutions. As part of these efforts in spam research, there is a need for the types of infrastructures that have proved useful in other areas of computer research. We identify three types of such infrastructures: 1) public trace data, 2) research tools, and 3) technical conferences. Public trace data has been used for years in networking research and in network security research. Recently, SpamArchive.org has been established to provide publicly available spam and non-spam archives useful for testing, training, and benchmarking. We discuss the goals, current status, and possible future directions of SpamArchive.org. Research tools are necessary for collecting, processing, and analyzing spam-related data. In the past, developers interested in contributing to anti-spam efforts largely have written spam filters. We stress the importance of other types of tools and discuss examples of necessary tools including: 1) tools to anonymize spam and non-spam messages; 2) tools to measure global spam activity; and 3) tools to perform automated testing including automated effectiveness and accuracy measurements. __________________________________________________ eXpurgate: a different approach in filtering E-Mail and detecting SPAM Robert Rothe, eleven GmbH, Germany eXpurgate is new service developed and provided by eleven allowing companies and consumers to reliably protect themselves against SPAM. Furthermore eXpurgate categorizes E-Mails into clean, bulk and dangerous and therefore allows its users to differentate between important, less important, dangerous and unsolicited messages. eXpurgate tests the main characteristic of SPAM-E-Mail, that is its characteristic of being sent en masse. This does not necessarly require an E-Mail to be forwarded through the system, but a short fingerprint or "key" communicated to the expurgate-system is sufficient to allow the system to perform the categorizition. This fingerprint gives no evidence of the textual content of the E-Mail. In my presentation I will describe the concept of eXpurgate and will address the following issues: Absence of a common SPAM definition; DNA of SPAM?; SPAM is just part of the problem; limitations of single-ended approaches __________________________________________________ Spam Filtering at the Network Level Matt Sergeant, MessageLabs Filtering at the network level presents some slightly different problems as far as anti-spam goes compared to filtering individual's emails, or the emails for one small company. For example the definition of "spam" varies greatly among different types of users, and the contents of their inboxes look entirely different. This talk will discuss the implications of this on various technologies, specifically Bayesian or probability based detection methods, and the implications for SpamAssassin. We will also discuss the new "Bayesian" component to SpamAssassin. __________________________________________________ Better Bayesian Spam Filtering Paul Graham, Arc Project Last year I found, to my great surprise, that a very simple algorithm could filter out over 99% of my spam with near zero false positives. Since then I've found several ways to improve the performance of this algorithm. This talk will describe the new techniques I've tried, how well they work, and what I plan to do next. __________________________________________________ Anti-Spam Techniques at Python.org Barry Warsaw, Pythonlabs at Zope Corporation The primary Python and Zope mail server supports over 100 mailing list of varying traffic volumes and dozens of personal email addresses and non-mailing list exploder aliases. These lists and aliases support the operation, development, and maintenance of Python and Zope. Most lists fall into the SIG category -- being largely technical in nature, with some smaller hobbyist lists thrown in for good measure. In this presentation I will talk about the tiered anti-spam defenses we have in place, from low-level Exim4 rules, to integration with SpamAssassin, to the anti-spam techniques in Mailman. I'll talk about how effective these defenses are, the administrative burden they impose, and our future plans for deploying additional tools to ease the management and reduce the false hits of spam and ham moving through our site. If no one else talks about it, I'll give some overviews of the Spambayes project and how that might fit in with the Python.org/Zope.org mail system. __________________________________________________ Smartlook: An E-Mail Classifier Assistant for Outlook Jean-David Ruvini, e-lab Bouygues SA, France In this talk we will present Smartlook, an assistant integrated into Microsoft Outlook that helps users file the E-mails they receive into folders. When the user selects a message, it predicts the three most likely folders for that message and provides shortcuts button that facilitate filing into one of the predicted folders. However, the user also has the possibility to specify some folders as "active". In that mode, whenever the user receives a message, Smartlook predicts the most likely folder but does not make suggestion: if its best guess is an active folder, Smartlook automatically files the message in that folder. This active mode is particularly useful to filter out spam: provided the user has created an active "Spam" folder in which he keeps examples of Spam email he has received, Smartlook automatically moves spam messages from the user's inbox to the "Spam" folder. Smartlook uses statistical techniques (a variant of the Naive Bayes classifier using Kuback-Leiber measure and Witten-Bell smoothing) to learn a model of the user filing habits and has shown to achieve excellent classification/filtering accuracy. __________________________________________________ Spam: Threat or Menace? An ISP's View Barry Shein, CEO, The World Spam, taken from the point of view of an ISP, is much worse than you may think. To most people spam is generally a grinding nuisance. To ISPs it is a threat to their very existence ("death of the net, film at 11!") This talk will survey many sound reasons why you should be much more depressed and upset about spam than you already are. I will also propose one possible solution likely to annoy most people in the room. __________________________________________________ Lessons from Bogofilter Eric Raymond, Open Source Initiative Despite being a LISP-head himself, Eric Raymond wrote the first C implemention of Paul Graham's algorithm from "A Plan For Spam" in C. The aim was to produce a fast, lightweight implementation that would attract general interest. The project succeeded in an unexpected way. Raymond will discuss the lessons in his talk. __________________________________________________ Gnus vs. Spam Teodor Zlatanov, spam.el Maintainer The Gnus news- and mail-reader runs in Emacs (GNU Emacs and XEmacs). The talk will be on spam.el, a Lisp package in the Gnus distribution, which deals with spam classification and filtering. Other possibilities for fighting spam in Gnus will be mentioned as well. __________________________________________________ Spam Filtering: From the Lab to the Real World Joshua Goodman, Microsoft Research Spam filters from Microsoft Research have moved from the lab to the real world-- they are currently being used by millions of people. I'll talk about some of the issues that we discovered as we went through this process. How do you evaluate a spam filter, and determine the right tradeoffs between catching spam, and making mistakes, and what are the fairest ways to do the evaluation? I'll talk about the big target problem: the bigger you are, the more customers you need to satisfy, and the faster spammers adapt to you. Personalized filters solve some of these problems, but create new ones. Finally, what is the right approach for the future? Like many others, we believe machine learning techniques in general, and especially probabilistic techniques, are the right approach. __________________________________________________ Integrating Heuristics with n-grams using Bayes and LMMSE Michael Salib, MIT While Bayesian spam filters have become popular recently due to the work of Paul Graham and others, they can be improved upon. I describe some new work on integrating heuristic spam detectors with frequency based models in a Bayesian framework. I also describe extensions of this work to n-gram language models and attempts to use linear minimum mean square error estimation techniques instead of Bayesian inference. The heuristics used include many network services like DCC and the realtime blackhole lists while eschewing phrase based heuristics, since those are better dealt with using frequency based approaches. __________________________________________________ Forty Years of Machine Learning for Text Classification David D. Lewis, Independent Consultant The first application of machine learning to text classification appeared in the Journal of the ACM in 1961. (Coincidentally, it used the Naive Bayes learning algorithm, now popular for spam filtering.) I will review the high points of what's been learned since then, in both the academic and operational text classification communities. I'll argue that, particularly for spam classification, techniques for selecting and preparing training data are more important than the choice of a particular learning algorithm or classifier form. __________________________________________________ How Lawsuits Against Spammers Can Aid Spam-Filtering Technology: A Spam Litigator's View From the Front Lines Jon L. Praed, Esq., Partner, Internet Law Group Winning the battle over spam will require the Internet community to utilize every tool available to it-- both technical and legal. Fortunately, the law already provides numerous grounds for legal actions against spammers, and major ISPs have an excellent track record of successfully using those laws to hold spammers accountable for the costs of their actions. In addition, the law is adapting to recognize new causes of action against spammers-- including proposed federal legislation that could lead to criminal penalties against spammers. This presentation will review the current state of the law outlawing spam, as well as proposed new laws. In addition, the presenter will share his experiences in the trenches, having successfully handled dozens of lawsuits against spammers for major ISPs, including AOL (versus LCGM, CN Productions & Cyber Entertainment Network), and Verizon Online (versus Alan Ralsky). This presentation will demonstrate how legal action against spammers can advance at least three important and distinct goals: (1) support filter technology by pursuing spammers who evade filters through fraud and other criminal means; (2) permit filter software developers to recover the costs of development; and (3) identify those who are likely to be committing (or planning) other cyber-crimes for financial gain. __________________________________________________ Desperately Seeking: An Anti-Spam Consortium David Berlind, CNET We have anti-spam solutions and ideas coming out of our ears. Some at the heart. Many at the edge. Others are legislative in nature. Charge a penny. The global nature of the Internet and the email ecosystem and the degree of universal cooperation that's required makes the pragmatism of many of these approaches questionable. The current system is clearly broken. Until there's a royalty-free standard anti-spam protocol that is transparently embedded, like SMTP is, into all parts of the email ecosystem (clients, servers, service providers, etc.) in a way that those parts are cooperating with each other to make that ecosystem air-tight, this problem isn't going away. So, it's time for the same vendors that make spam possible (basically, all of them) to do something that they've never done. They have to get together to figure out how to make it impossible. Only then will there be enough disincentive to make spammers find another line of work. If vendors can cooperate over specifications like XML, SOAP, Wi-Fi and Bluetooth, they can cooperate over this. __________________________________________________ Fighting Spam in Real Time Ken Schneider, Brightmail Spam is an extremely dynamic problem. Spam attacks are mutating and changing as never before-- spammer tools continue to become more sophisticated. Identifying and combating spam in real-time allows an anti-spam solution to have very high effectiveness and very high accuracy. I'll discuss a real-time approach to filtering spam and spam trends that are having an impact on email users. Lastly, I'll discuss a spam policy that enables an aggressive approach to filtering, while allowing legitimate, solicited messages to pass through.