IBM scientists decode the DNA of spam

August 25, 2004

IBM scientists have used a mix of DNA analysis techniques and Feng-Shui to combat the growing tide of spam.

IBM have developed a junk-mail filter, dubbed "Chung-Kwei" after an ancient Chinese symbol of protection, using DNA analysis techniques to learn patterns of spam vocabulary.

Tests conducted by scientists at the TJ Watson Research Center have so far proved to be 96.5% effective in finding spam with only one false positive in 6,000 messages. The filter was trained with 87,000 messages then another 88,000 messages containing spam and normal email was passed through the algorithm.

The scientists responsible for the techniques at IBM's bioinformatics and pattern discovery research group have only been working on the algorithm for just over a year. It grew out of another research project, Teiresias, which researcher were using in biological sequencing of DNA and proteins.

This algorithm determined the structure of proteins from how they strung together. The same process was used to look at strings of character that are found in spam but not in normal email.

The method builds up a data of patterns it has found and continually adds new patterns to that database. It looks for two or more patterns that occur either within a message or across two or messages. In effect it is learning a vocabulary of spam as a opposed to using statistical analysis of keywords used by many present day spam filters.

Any incoming mail is processed to see if they match these patterns, the more patterns that match the more likely the message is to be spam. This is known as the "guilt by association" methodology that is popular in life science and computer security applications.

The new algorithm will form part of IBM's anti-spam product, SpamGuru which will also incorporate several different filtering techniques. One of the first products to benefit from this will be Lotus Workplace.

http://www.ceas.cc/papers-2004/153.pdf