A Unified Model of Spam Filtration

Bill Yerzunis
Mitsubishi Research Lab

Abstract

A large number of spam filtering and other mail classification systems have been proposed and implemented in the recent past. This paper describes a possible unification of these filters, allowing their technology to be described in a uniform way, and allowing comparison between similar systems to be considered in a more analytic style. In particular, describing these filters in a uniform way reveals a large commonality of design, and explains why so many filters have such similar performance.

The full paper in .pdf format is: http://crm114.sourceforge.net/UnifiedFilters.pdf and other formats are available.


People and Spam

John Graham-Cumming

The results of a large survey of email users designed to understand their use of spam filtering techniques, their experience and attitudes with spam. Designed to answer the following questions:

  1. How much time are people spending dealing with spam even once a filter is installed?
  2. How much and what types of spam do people receive?
  3. What technologies do people prefer to use and is there a correlation with computer experience, gender, ...
  4. How much are people willing to pay to be rid of spam?

The last part of the survey consisted of a test to determine how good people are at spotting false positives and whether anything can be done to make them easier to pick out of a spam folder.

After the conference I'll make available all the raw data from the survey for others to analyze as they wish.


Project Honey Pot

Matthew Prince

In early November 2004 Unspam launched Project Honey Pot. The Project allows web site administrators to install honey pot pages on their sites. These pages, when accessed, display a legal warning and a unique email address. While these email addresses look completely normal (e.g., joe.smith@xyz.com) they are keyed to a moment in time and the IP address of the particular visitor. They are also created in such a way that any mail sent to them is directed to the Project Honey Pot mail servers.

Our purpose in creating this service was to see a relatively uninvestigated segment of the "spam cycle." This cycle generally consists of a spammer first obtaining a list of email addresses, then procuring contracts from businesses wanting to be advertised, obtaining bandwidth or proxies through which to send the messages, designing the messages so they will avoid spam filters, and finally sending the messages.

To this point, most anti-spam efforts have focused only on the last steps in the spam cycle. Filters, for example, try to stop messages sometime between when a spammer has pressed the send button and when the messages begin arriving in the recipient's inbox. This is true whether the filter is traditional, Bayesian, based on an IP RBL, or even a URL RBL. In any case, the economics are in the spammer's favor. Because the cost of sending messages is so low, even with a filter that stops 99.99% of messages, once a spammer has your address eventually he will be able to get some messages through.

To this end we wanted to investigate the process spammers use to obtain email addresses in the first place. Two weeks after its launch, Project Honey Pot has hundreds of users and has been installed on web sites on every continent (except Antarctica; we're working on that). We've definitively caught a number of harvesters and are adding more to our list every day. You can find more information about the Project online at:
http://www.projecthoneypot.org/
(By the way, we'd love web sites involved in the 2005 Spam Conference to sign up and install honey pots. In addition to helping us gather data to present at the Conference, it can give you a sense of when and how the spammers targeting gh@archub.org first obtained the address.)

Given the data we've collected already, we believe by January we'll be able to provide insight into many of the following questions:

The other cool part about this, and the reason we at Unspam are particularly excited, is that it may lead to a new and more fruitful avenue of legal prosecution. For example, the Federal CAN-SPAM act clearly and explicitly forbids email address harvesting. There is no "prior business relationship" exception, no allowance that harvesters get one bite at the apple so long as they let individuals opt-out later. In fact while it's hard for some to agree on the definition of "spam," even the Direct Marketing Association agrees that if you are harvesting email addresses you are a spammer. Period.

Moreover, harvesting may provide a more traceable route back to the spammers. When a spammer sends a message it is a virtually one-way transaction. Spammers can dump their messages on a proxy server and forget about them. On the other hand, harvesting is a two-way street. In order to be useful, harvesters must visit a huge number of web pages, gather the email addresses they contain, then return those addresses to a central location. While not impossible, it is certainly more challenging to obscure this process.

Our evidence so far seems to confirm this point. While the spam messages our honey pot addresses are receiving are exactly what you'd expect -- being bounced off machines in China, Korea, and other zombies scattered around the world -- by in large the harvester IPs we've discovered are being run out of the United States. Moreover, they appear fairly stable and to be leased by the actual individuals doing the harvesting. If that continues to be the case as we gather more data, we may have found a new route back to the actual individuals responsible for the spam problem, and a new way to attach legal liability to them.

 To this end, we've partnered with Jon Praed from the Internet Law Group and are working with other law enforcement officials. While we're early in the process, by January I think we'll have a lot of exciting data and news to announce. I'd love to be able to do so at the 2005 MIT Spam Conference.

Please let me know if you have any questions. And, anyone with a website, we'd love to have them install a honey pot and help with the Project!


You've Got Jail: Some First Hand Observations from the Jeremy Jaynes Spam Trial.

Jon Praed

I'd like to speak on the legal developments over the past year in the fight against spam. I plan to focus on the effectiveness of the Federal CAN Spam statute which went into force January 1, 2004, AND provide a first hand account of Virginia's successful criminal prosecution of the Jeremy Jaynes/Gaven? Stubberfield Spam Gang.....

By way of background, Jaynes was listed as the #8 spammer on ROKSO's top 10, and is rumored to be worth over $20 million (at age 29). He was arrested for spamming in December 2003, and was tried in October 2004, along with his sister and a co-worker. He and his sister were convicted of 3 spam felonies each under Virginia's anti-spam statute (the co-worker was acquitted), and Jaynes was sentenced to 9 years in prison. The trial provided a fascinating glimpse inside the spam world, and revealed a lot about how spam filters can be used to make it easier to prove criminal spam conduct by forcing spammers to adopt clearly illegal means of distributing their spam.


Classifier Aggregation

Speaker: Richard Segal
Contributors: Richard Segal, Jeff Kephart, Shlomo Hershkop, V.T. Rajan, and Mark Wegman

IBM Research

Every algorithm for detecting and filtering spam has its advantages and weaknesses. The goal of classifier aggregation is to combine multiple antispam filtering techniques into a meta classifier that ideally combines the best of each classifier while avoiding their weaknesses. This talk will explore the question of whether this ideal can be met in practice. The work I plan to present is an extension of earlier work on static classifier aggregation [1]. In that work we demonstrated that statically learned weights produce a combination spam filter that outperforms any single algorithm in both effectiveness and accuracy. Our new work looks at classifier aggregation in the dynamic, online learning case in which the relative merits of the individual classifiers can vary over time. Real-time adaptation is important for robust antispam solutions since spammers can learn to defeat individual antispam algorithms. The talk will compare several approaches to classifier aggregation and show what we have learned about their relative effectiveness for antispam filtering. The talk will discuss what algorithms combine well, which do not, how to manage the computational complexity of multiple algorithms, and the open problem of how to handle highly skewed data from asymmetries in user voting.

References

[1] R. Segal, J. Crawford, J. Kephart and B. Leiba, SpamGuru™: An Enterprise Anti-Spam Filtering System. In Proceedings of the First Conference on Email and Anti-Spam, July, 2004. http://www.research.ibm.com/spam/papers.html

Spam Kings

Brian McWilliams
Investigative Journalist
http://www.spamkings.biz/

I'd like to present a summary of some of my key findings from the 12-plus months I spent researching and writing the book. My talk would describe, for example, the rise and fall of Davis Hawke, a former neo-Nazi and chess expert who started spamming in 1999 and eventually became a millionaire from selling penis pills -- until AOL sued him in March 2004. (Hawke then turned to cell-phone spamming as well as selling lists of email addresses stolen by an AOL engineer.)

My presentation would also cover some of my conclusions about online consumer behavior. It is my belief that, despite better technology and laws, spam will continue to vex us until people stop buying from spammers. I will describe what I term "furtive e-commerce" as well as the types of shoppers who are susceptible to pitches from junk emailers.


Regulation Instead of Stopping

Rui Dai
Georgia Tech
Kang Li
Georgia Tech Univ. of Georgia

An ancient Chinese tale tells how Great Yu tamed the waters: once upon a time, there were great swampy areas and much flooding in China. Many people tried to stop flooding by simply building higher dikes. However, all these efforts failed since the waters couldn't find their way to the sea and the dikes eventually collapsed because of more and more pressure from water. Then the emperor appointed Great Yu to solve this serious problem. Great Yu worked very hard on this project but in a smart way . Instead of following his predecessors' strategy, Great Yu regulated the courses of the rivers so that the waters could flow smoothly to the sea. He is also credited with the invention of the irrigation of fields and dams. And Great Yu's strategy worked.

We are faced with similar problem today and the majority of our anti-spam works are applying the strategy of Great Yu's predecessors, i.e. stopping spamming using filters. The current spam and anti-spam are in an arms race. E-marketers (including spammers) are trying hard to reach as many potential customers as possible. Their current practice is to send marketing messages to a large amount of recipients, to compensate a very low "open rate". On the other hand, anti-spam approaches are limited to blocking spam messages at the recipient side, which help reduce the open rate. The outcome of this arms race is that a high volume of messages will be delivered but most of them get dropped, and the volume is only growing higher. This result is far from efficient. Why do e-marketers keep on sending out emails? Obviously, there are recipients who do read the messages and maybe buy products. Using filters will simply shut down a lot of these potential transactions. So the question is: is there a way that we may regulate emails so that the senders can only send emails to those who are interested in the content? This is similar to building an irrigation system to direct water to where it is needed.

One might say we can use perfect Bayesian filters to accomplish this. That's correct, if the filters will correctly reflect people's preference (in real-time) and the spammers can not find any way to crack the filters. Unfortunately, we can not see any of these two conditions will be met in the near future, if ever (given John Graham-Cumming's presentations in last 2 years).

Our presentation proposes an alternative for anti-spam approaches, helping e-marketers finding the potential recipients. In this proposal, we use aggregated user preference combined with a cost mechanism to build a channel to communicate with marketers before message being delivered to the final recipients. There is a third party in our system, called mail arbiter, to indirectly communicate between sender and recipients before email delivery. This system is very different from previously proposed third-party/economics solutions: 1) A novel aspect of this approach is that we use spam filters as the way to express user preferences. Each recipient only needs to supply an interface for the mail arbiter to test a message against the user's preference. Contrasting to the inaccuracy of individual filters, we find aggregated preference is more predictable. 2) Our solution can provide objective measures on the quality of emails. 3) This approach can couple with any cost mechanism, such as monetary, tarpit, etc.

When a sender wants to send a message to a list of users, the message will be first forwarded to the mail arbiter. The price arbiter tests the messages over many recipient's filters and come up a "price" (here the price is not limited to money) based on the aggregated filter results. A message that is welcomed by many users (indicated by the filters) is assigned with a low price, and vice versa. The sender has the choice of paying the price to send the message, or refine the target list. Since the third part has access to recipients' preference, it can provide suggestions on how to refine the list. For example, the third party can tell the seller that with a new target list X, the expected open rate will be Y and price will be Z. The sender can then pick the best target list by trading off the benefit and cost. This is just similar to directing water to where it is needed. We have developed a prototype to demonstrate this whole process. We are on the stage of experimenting our solution.

One challenge to cost based approaches is to avoid exposing user preferences to the senders, otherwise marketers might manipulate their messages to pretend fitting well to use's preference. We avoid exposing user preference by negotiate prices based on aggregated user preference, so individual spam filter behaviors won't expose to the sender. It is also easier for the third party to detect this kind of exploring. Moreover, the senders may not even have incentives to so since they should be better off by following the suggestions of the third party.


Using Personal Email Network Structure to Fight Spam

P. Oscar Boykin
Assistant Professor Electrical and Computer Engineering, University of Florida

I wrote the following paper which got some attention on Slashdot and Nature. The method is to look at the graph created from the "From", "To" and "CC" headers of email to reconstruct the recipient network of all one's email. Then, simply using graph metrics we can classify about half of the email as spam or nonspam without any training. This method could be used for automated whitelist/blacklist generation, or to train content based filters without any human input.


Report on the French Government's Approach to Fighting Spam

Eric Walter, Constance Bommelaer
Direction du Développement des Médias
Prime Minister Services FRANCE

Paper here (en Français)

To fight spam effectively requires the implementation of a series of actions on several levels : the effective application of anti-spam law, awareness raising among surfers, the development of technical solutions and strong international cooperation.

It appears that various technical measures are currently available; each one can play a role in the battle against spam. When combined, these measures can provide a "good enough solution" to the spam problem for email users. Coupled with appropriate legislative and legal action, such measures may even help turn the tide against the spammers.

In July 2003, the French Government announced the creation of a dialogue and action against spam group whose objective is to support the dialogue between the public and the private actors in the fight against spam and the coordination of their actions, both in France and on the international level . A number of working groups are currently working on regulation, technical measures, how to deal with complaints, cooperation at international level and other topics.


Distinctions Between Message Authentication and User Authentication

Jim Fenton
Cisco
co-author of the Identified Internet Mail draft

Several proposals have recently emerged for authenticating email messages. This has led to some concern about the effect of these proposals on sender anonymity, especially for signature-based mechanisms which can support per-user signatures. This paper will explore the distinctions between the assertions made by message signatures as compared with conventional signatures such as PGP or S/MIME. In particular, is it possible for a message to be signed and still anonymous? Will unsigned messages become extinct?


Bayesian Noise Reduction: Progressive Noise Logic for Statistical Language Analysis

Jonathan A. Zdziarski
3069 Heritage Rd. Milledgeville GA 31061

Abstract

Modern day language classification requires the use of machine learning, which relies heavily on presented learning input. Most of today's algorithms (Bayes, Chi-Square, etcetera) are inherently sound and accurate, however regardless of which algorithm is used, a great deal of the algorithm's accuracy is related directly to the quality of data provided - the Garbage In, Garbage Out theory. Bayesian Noise Reduction is a statistical approach to pattern identification and feature omission which can be implemented as a "pre-filter" in front of existing language classification functions to provide better data for processing. BNR attempts to solve the problem commonly referred to as "Bayesian Noise". Bayesian Noise in its simplest definition refers to irrelevant data present in a message being classified. Bayesian Noise Reduction, in short, dubs irrelevant text in order to provide cleaner classification.

1 Introduction

All samples of text contain some degree of noise. That is, data which is either intentionally or unintentionally irrelevant to accurate classification of the sample whose removal would result in a cleaner results. With the noisy data removed from the sample, what is left is only data relevant to the classification. We'll discuss the algorithm in the setting of spam filtering as this is the area it's been most widely studied, although the detection and removal of noise in text is not directly related to spam and could cover any other type of text.

Version 2.0 of the Bayesian Noise Reduction algorithm incorporates a statistical implementation of pattern identification in order to learn the particularly interesting patterns of text for an individual. The algorithm itself does not make any determination about the text sample, but rather attempts to filter out any noise so the classifier can more accurately perform this function. BNR relies heavily on the statistical value of individual tokens in a text (as assigned by the classifier) and uses these probabilities to instantiate and record "patterns" of token p-values across a window, then compares the "pattern" of text with the actual p-values of the tokens that fit into the pattern.

1.1 Different Types of Noise

The problem: Noise present in text samples can lead to misclassification.

There are four primary types of Bayesian Noise which can be categorized into the following groups:

• Common Noise: Common noise represents the general noise present in all text samples. • Junk Words: Junk words or alphanumeric tokens injected into or present in a message. • Arbitrary Word list Attack: An insertion of a long series of dictionary words, last names, or other collections of words such as a novel, insensible text, and the like. This is done in an attempt to purposely veil the test. • Directed Word list Attack: The intentional mining and insertion of specific samples of text that will result in the misclassification of the sample.[1]

This algorithm provides relief in all areas of Bayesian Noise allowing classification to benefit in samples of both veiled and noisy text.

1.2 Sparse Noise Pattern Identification

Sparse noise can be defined as data inconsistent with the disposition of the pattern it belongs to. If a pattern of text has been learned by the filter to be "spammy", but an individual token within the pattern has a "nonspam" disposition, that token can be considered noise and dubbed, or eliminated, from classification of the sample. The patterns themselves, covered more in-depth later in this paper, become specific to a user's particular training set and are learned in training.

The identification of noise is very specific to the contextual data stored within the classifier's dataset, and depends on the classifier's calculated p-values for the tokens in a message and the patterns they belong to, which are ultimately learned from the user's behavior. BNR's basic operations are, in part, based on the theories of Habituation and Surround Inhibition in the fields of Biology and Artificial Intelligence research.

Habituation is the brain's way of filtering out background noise. Habituation is seen, for example, in the adaptation of bipolar and amacrine cells in the retina causing still objects to no longer be seen by the human brain after 20 or 30 seconds. BNR identifies well-known patterns which have been "seen" over long periods of time, whose p-values in comparison to those of the tokens within the pattern determine the stimulus and applies artificial Surround Inhibition.

Surround inhibition allows certain neighboring cells to be inhibited by the stimulation of others (in many cases, of the opposite condition) - for example a dark cell in the retina may become stimulated when there is dim light, inhibiting adjacent neighbors, which may be light cells. When paralleled to language classification, many of the tokens we discussed falling into a condition of habituation (and therefore our stimulus) may also inhibit their neighbors.[5]

In plain terms, sparse noise can be described as pockets of text the classifier should ignore (based on inconsistencies between patterns and the underlying tokens) which lead to the creation of larger blocks of text that should also be ignored.

2 The Bayesian Noise Reduction Algorithm

The Solution: Remove the noise, leaving only relevant text for the classifier

There are three primary methods employed in the Bayesian noise reduction algorithm. The first method is the pattern learning period, where patterns are created and their disposition learned by the filter. The second method uses the patterns learned and performs "dubbing" or elimination of tokens whose disposition is inconsistent with the pattern of text they belong to. The third method performs concurrent elimination of data from the sample up to a stop marker. Once a stop marker has been reached, certain checks are performed on the length of the concurrent elimination to determine if the elimination should be made permanent. In short, the steps involve:

  1. Learning and identifying patterns of interest
  2. Identify the pattern windows whose underlying tokens are inconsistent with the disposition of the pattern
  3. Identify adjacent windows in-between windows believed to be noisy

2.1 Learning and Identifying Patterns of Interest

The process begins with the identification of patterns during a learning phase. The patterns BNR examines are the patterns of token p-values inside a window. Each p-value is assigned to a band with a width of 0.05. For example, with a window size of 3, the following patterns may be formed.

Tokens: Viagra (0.92000) is (0.64000) great (0.34000) for (0.71000)
Pattern: 0.90 0.65 0.35 0.70
Meta-Tokens: bnr.s.0.90.0.65.0.35  
  bnr.s.0.65.0.35.0.70

A metatoken is then created out of the pattern created by this token set. In the example above, bnr denotes that the token is a meta-token used by the Bayesian Noise Reduction algorithm, while s denotes that the pattern consists of single tokens. An alternative character could be used to denote patterns consisting of nGrams, such as bnr.2 or bnr.3. The name of the pseudo-token is entirely at the discretion of the implementor.

A set of metatokens is generated for each message being processed, and patterns are learned through both supervised and unsupervised (filter-automated) training. It may, however, improve effectiveness to train patterns only on email that the filter does not have a high confidence level in. Each pattern is stored with a counter for each disposition available, for example spamHits and innocentHits. Once a minimum training threshold has been reached, the patterns may be assigned a p-value using Graham's[3] approach for calculating p-values. For example:


Standard rules for assigning p-values to hapaxes may also be applied at the implementor's discretion. After some initial training, some patterns will take on the disposition of one particular class of text:

bnr.c.0.25.1.00.1.00 [0.99990] bnr.s.0.35.1.00.1.00 [0.99990]
bnr.s.1.00.1.00.0.20 [0.99990] bnr.s.1.00.0.40.1.00 [0.81868]
bnr.s.1.00.1.00.0.25 [0.99990] bnr.s.0.55.1.00.1.00 [0.99990]
bnr.c.1.00.1.00.0.35 [0.99990] bnr.s.0.25.1.00.1.00 [0.99990]
bnr.c.1.00.1.00.0.15 [0.99990] bnr.c.0.15.1.00.1.00 [0.99990]
bnr.c.0.10.1.00.1.00 [0.99990] bnr.s.0.35.1.00.0.40 [0.99990]
bnr.s.0.40.0.35.1.00 [0.99990] bnr.c.0.20.1.00.1.00 [0.99990]

In the selected patterns above, notice how each pattern is considered very "spammy", however there are bands of innocent tokens within the pattern. These are the tokens we'll now identify and dub in section 2.2.

2.2 Identifying Pattern Window Inconsistencies

After an initial period of learning, the pattern bnr.s.0.65.0.35.0.70 may take on a very spammy disposition (for example 0.95000). The first step in identifying inconsistencies is to identify interesting pattern windows. This can be done by creating an exclusionary radius from a neutral 0.5. A good radius for many implementations is 0.25. Next, to identify inconsistencies between the pattern window and the tokens it overlays, we create another exclusionary radius around the tokens in the classification medium as a delta from the value of the pattern window, identifying any tokens which fall outside of that exclusionary radius. We use ABS(windowProbability-tokenProbability). A good radius for token distance for most implementations is 0.30. We see that the middle token's band is out of range. Similarly, should this pattern have an innocent disposition (for example, 0.15000), we would see that the two end tokens' bands are out of range. The two radii may be tuned by the implementor for adjusting sensitivity.

Once we have identified tokens with an inconsistent disposition to the pattern they're included in, these tokens can be "dubbed", or eliminated, out of the classification critera.

2.3 Elimination of concurrent blocks of noisy text or "dubbing"

Once inconsistencies have been identified and eliminated, we can now span across multiple adjacent windows for tokens within a certain band radius until we reach one that does not. These adjacent tokens can then also be dubbed (eliminated) from the calculation.

When an inconsistency is identified, a dubbing counter equal to the window size is initiated for the next tokens in the chain. The dubbing process begins a loop to the next token in the stream, comparing the p-value of the original pattern window to determine if the next token falls outside of the exclusionary radius. If it does, the dubbing counter is reset to 3, and if the next token's window is also considered "interesting" (that is, outside of the window exclusionary radius), then the dubbing loop begins again using the p-value of the new window. If the token is within the exclusionary radius (and therefore not dubbed out), the dubbing counter is decremented by 1 and the loop continues to the next token until the counter is exhausted.

2.4 The elimination and dubbing processed

This illustration makes the following assumptions from the previous description in this paper:

let windowSize = 3 let windowRadius = 0.25 let tokenRadius = 0.30

program start let dubbingCounter = 0 begin loop (each token in ordered stream)
generate pattern window (window) from next windowSize tokens
load windowPValue for window
let interestingWindow = (ABS(0.5-windowPValue)>windowRadius)
if (interestingWindow or dubbingCounter > 0)
    begin loop (each token in window)
        load tokenPValue at token
        let inconsistentToken = (ABS(windowPValue-tokenPValue)>tokenRadius)
        let dubbableToken = (ABS(dubbingPValue-tokenPValue)>tokenRadius)
        if (inconsistentToken or (dubbingCounter > 0 and dubbableToken))
            eliminate token
            if (inconsistentToken)
                let dubbingPValue = windowPValue
            let dubbingCounter = windowSize
        else if (dubbingCounter > 0 and !dubbableToken)
            dubbingCounter--
            if (! dubbingCounter and !inconsistentToken)
                break loop
    end loop
end loop program end

2.5 Further Improvements

At the discretion of the implementor, the pattern metatokens themselves may also serve as useful tokens in the statistical combination. This may further help the effectiveness of the interesting patterns identified.

3 Supporting Data and Experimentation

4 Conclusions

5 Acknowledgments

6 References

[1] Graham-Cumming, J. "How to beat a Bayesian Spam Filter", MIT Spam Conference 2004
[2] Graham, P. "So Far, So Good", August 2003
[3] Graham, P. "Better Bayesian Filtering", January 2003
[4] Yerazunis, W. "The Spam Filtering Accuracy Plateau", MIT Spam Conference 2003
[5] Jackson, C. Introduction to Articial Intelligence ISBN 0-486-24864-X

Using Lexigraphical Distancing to Block Spam

Jonathan Oliver
Director of Research
MailFrontier, Inc.

Spammers are continually developing new tricks to get their messages past spam filters, including altering content so it is difficult for a filter to identify the content, but humans can still understand it. Bayesian classification is well suited to identifying spam because it is adaptive, allowing spam filtering to change as spam evolves [1]. However, the adversarial environment of spam provides a challenge [2]. The adversarial-parsing technique called lexigraphical distancing can correctly identify the intended content of mutated or morphed expressions. Results from using this technique will be presented to demonstrate how this approach significantly increases the effectiveness of Bayesian spam filters.

Tricks used by spammers to get through Bayesian filters. Bayesian filters can estimate the probability that an email is spam by computing the product of individual probabilities for each feature. The effectiveness of the estimation is largely based on the set of features selected and the accuracy of the estimation of the individual feature probability. Spammers actively try to bypass spam filters by using tricks to hide or minimize the impact of the spam words included in their email messages. Features that are indicators of these tricks should also be used to estimate the overall probability that an email is spam. Spammers are aware that content-based email filters analyze the words, headers and other content used in the emails to determine whether the email is spam. The majority of spam today includes content which is placed there to assist the spam in getting through Bayesian filters. This content comes in two forms: (1) Spam-like content is mutated or morphed, in such a way that it is difficult for a filter to identify it, but the human eye can readily identify the intended content. Examples of this include: (a) Misspelling: "Viagra" as "Vlarga" (b) Using symbols which look similar: writing "Viagra" as "\/1@gr@", and (c) Combinations of these techniques. Using these techniques, there are at least 600,426,974,379,824,381,952 ways to spell Viagra [3].

(2) Good content is added to encourage the Bayesian filter to identify the email as good. This good content can consist of random words (sometimes referred to as word-salad), or strings of random letters (such as "qyywiq").

This presentation will focus on issue (1), mutated or morphed content. The adversarial-parsing technique called lexigraphical distancing will be presented which can correctly identify the intended content of mutated or morphed expressions. This parsing technique calculates the edit distance between the content in an email and a library of spam-like terms and phrases. If a portion of the email and a spam-phrase are suitably close in terms of edit distance then the email can be scored as if the email contained the spam-phrase rather than the actual phrase in the email. Results of using lexigraphical distancing will be presented and will demonstrate that such an approach significantly increases the effectiveness of Bayesian spam filters.

Effective spam filtering does not happen in a static environment. There is an adversarial relationship between spammers and spam filters. Spammers aggressively try to trick spam filters and pass their emails off as legitimate, while spam filters update their methods in an attempt to catch the newest forms of spam. The methods discussed in this presentation significantly improve the effectiveness of Bayesian filters in stopping spam.

References

[1] Graham, Paul. A Plan for Spam.August 2002.
[2] Graham-Cumming, John. How to Beat an Adaptive Spam Filter.
[3] There are 600,426,974,379,824,381,952 ways to spell Viagra 7 April 2004.

Speaker Dr. Oliver is a machine learning expert with 17 years experience in predictive and statistical technologies. He is responsible for the development and deployment of MailFrontier™ technology that identifies and filters out email threats, including spam, phish and viruses. Oliver leads the MailFrontier™ Research team's efforts in analyzing the latest email threat trends and techniques, and applying this to the MailFrontier™ development process. Prior to joining MailFrontier™, he performed research and development for organizations including NASA, eBay, MSN, FAA, and UC Berkeley. Oliver received his Ph.D. in Computer Science from Monash University.


Bayesian Spam Classification Applied to Phishing Fraud

Andrew Klein, Product Manager
Eugene Koontz, Jonathan Oliver and Andrew Klein, Phishing Fraud Authors

Introduction

Bayesian spam filtering has a well established history as an anti-spam weapon [1]. Spam is generally email that markets a product or service. However, there is a newer kind of spam that is distinct from most spam studied previously which is called "phishing" email. These emails are scams that spoof legitimate companies in an attempt to defraud people of personal information such as logins, passwords, credit card numbers, bank account information or social security numbers. The term phishing was coined because the fraudsters are "fishing" for personal information that can be used for identity theft or financial gain. Unlike marketing spam, phishing emails are specifically designed to resemble good email. Traditional spam filters cannot effectively catch these phishing frauds because they are designed to resemble legitimate emails. However, Bayesian classification can be specifically tailored to estimate the probability that an email is a phishing fraud. This presentation will discuss the features that are specific to phishing emails and how Bayesian email filters can be modified to correctly identify these fraudulent emails.

Feature Detection in Phishing Emails

Phishing fraud email is characterized by a set of challenges that a tokenizer must face which are distinct from spam. The most significant type of tokenizer evasion in phishing fraud involves manipulating the URLs in the HTML portion of the email. Common spam uses URL obfuscation to evade detection of hosting location by anti-spammer filters. Phishing fraud email uses URL obfuscation, therefore, of greater sophistication than normally seen in spam. The following are some of the specific kinds of URL manipulation observed: 1. Link inconsistency - a URL is displayed within an A element, but the link target (value of HREF attribute) does not match this text. This is the most common form of deception practiced by phishers. Sophisticated email users attempt to detect this by "mousing over" a link to determine the actual link target displayed in the lower left of their email client's display. However, fraudsters exploit a range of security holes to avoid detection by this "mouse over" technique. 2. Using punctuation or spelling variation to mimic legitimate hostnames and domains. Phishers register domains that resemble legitimate domains; observed examples include: a. updatepaypal.com b. securityupdt.org c. signin_ebay_com_account.ministop.co.kr 3. Non-standard ports. Rather than using the standard HTTP port (80) or HTTPS (443), spammers often use a higher-numbered port. Such use may indicate a host that, unbeknownst to the legitimate owners, has been penetrated by fraudsters who have established a webserver process listening to this non-standard port to conduct fraud. Other important features to consider in Bayesian phishing analysis are: • Words and phrases which are commonly in phishing emails, • The difference between the reputation of the actual links and the reputation of the displayed links and domains in the email, and • The use of forms requesting information in email.

When applying Bayesian classification to filtering fraud, other spam tricks are deemphasized, such as nonsense strings of letters, word salad, or scrabbling or additional letters, phishers want to impersonate a legitimate communication from a bank, and therefore, are less likely to use these techniques.

This presentation will highlight these techniques that are unique to phishing emails and will present results showing that Bayesian filtering can correctly identify phishing emails when these features are added.

References

[1] Graham, Paul. A Plan for Spam. August 2002.
Speaker Andrew Klein is the anti-fraud product manager with MailFrontier™. With more than 20 years of software experience, he is an industry speaker on customer-facing enterprise systems. Previously Klein developed classified software systems for the government. His current focus is eradicating the scourge of email fraud from inboxes everywhere. He is a member of the Anti-Phishing Working Group, is a selected speaker at the RSA Security Conference on the topic of Phishing, he has been quoted in multiple articles in the press on Phishing and email related security matters, and was instrumental in developing the Phishing IQ test - which has been taken by over 250,000 consumers to date.

Mail Avenger

David Mazières

I will describe a system that makes it easy to prototype new spam-fighting ideas. The system is portable to most Unix platforms, and freely available from http://www.mailavenger.org/.


Spam-I-am: A Proposal for Spam Control using Distributed Quota Management

Hari Balakrishnan, David Karger
3rd ACM SIGCOMM Workshop on Hot Topics in Networks (HotNets)
San Diego, CA, November 2004

Abstract

Email spam has reached alarming proportions because it costs virtually nothing to send email; even a small number of people responding to a spam message is adequate incentive for a spammer to send as many messages as possible. Since spammers need to send messages at high rates to as many recipients as they can, quotas on email senders could throttle spam. We argue for separating the allocation of quotas, a relatively rare activity, from the enforcement of quotas, a frequent activity that must scale to the billions of messages sent daily.

This paper tackles the quota enforcement problem, where the goal is to ensure that no sender can grossly violate its quota. The challenge is to design an enforcement scheme that is scalable, is robust against malicious attackers or participants, and preserves the privacy of communication, in a large, distributed, and untrusted environment. We discuss the design of such a system, Spam-I-am, based on a managed distributed hash table (DHT) interface, showing that it can be used in conjunction with electronic stamps (for quota allocation) to ensure that any non-negligible reuse of stamps will be detected.


Standardized Spam Filter Evaluation

Gordon Cormack
University of Waterloo

Although this audience largely accepts that filtering is an effective approach to spam abatement, there are as yet no standard tests to measure effectiveness. Instead, we rely on testimonial evidence or on unrepeatable, uncontrolled and statistically unsound measurements to form this opinion. What's the harm?  After all, we all know the assertion is true.  I suggest three major reasons in support of standardized evaluation:  (1) we must convince more skeptical others that filters work well, (2) we must have a basis for comparison among existing approaches, (3) we must be able to measure effectiveness so as to continue to improve (e.g. beyond Yerazunis' "99.9% plateau").

TREC (The Text Retrieval Conference), sponsored by the National Institute of Standards and Technology, is hosting a spam evaluation track in 2005.  I am the coordinator of the spam track and as such, responsible for developing the an evaluation toolkit and methodology.  I have done a pilot evaluation and created a preliminary set of guidelines for the spam track.  Academic and industrial organizations are invited to participate in the spam track - TREC's call for participation will be issued in January 2005. The creation of a standard set of evaluation tools presents a number of challenges, which I shall detail in this talk.  One major problem is the fact that email is private, so a public corpus can at best approximate real email.  Therefore, we will use an open architecture that clearly separates the roles of filtering, testing, and evaluating the results. All components of the architecture will be available so that participants can perform all three tasks.  The official TREC task will involve a combination of private and public test suites, all executed using the same test tools.  That is, there will be two modes of testing:  one using public data that is administered by the participant; one using private data that is administered by the proprietor of the data.

In this talk I will describe the test tools, and the evaluation methods that are proposed for TREC 2005.  I will demonstrate an open-source evaluation kit with the three components described above, and show the results of running several contemporary spam filters using the kit.

I will argue that "accuracy" tells an incomplete and misleading story, and advance a handful of quantitative and categorical effectiveness measures I believe more accurately characterize "effectiveness."  I'll illustrate the measures using the results from my pilot evaluation.

Note

TREC 2005 will be hosting a spam filter evaluation track, and I'll be the coordinator.  A standard toolkit will be available so that participants can submit filters, or run evaluation suites, or both.  For 2005, the task will be based on the model developed in http://plg.uwaterloo.ca/~gvcormac/spamcormack.html.

At the spam conference I'll talk about the evaluation model, some results on contemporary spam filters, and the organization of the spam track for TREC 2005.  I will encourage filter developers and administrators to participate in TREC 2005.