Spam Filters, Alien Technology and Ruby on Rails July 5th, 2006 by Arto

Lisp - made with secret alien technology

When Paul Graham’s A Plan for Spam made its dramatic entrance into the anti-spam battle four years ago, it heralded the beginning of the end for spam — as we knew spam back then, anyway. Applying a simple statistical approach, based on word frequency analysis with a naive Bayesian classifier, Graham described how to create a spam filter accurate enough (99+%) that false positives effectively ceased to be an issue.

The central idea of the Graham Algorithm was quickly adopted en masse by spam filters, and as a result, the spam arms race has in the past few years tipped in favor of the good guys. “Successful� spam has devolved into exactly what Graham predicted it would: “some completely neutral text followed by a URL.� For me personally, the combination of good server- and client-side filters has made spam yesterday’s problem. (Well, that, and using Gmail as a front-end for lower-priority e-mail; spam all you want, it’s Google’s problem and they’re up to the task.)

Recently at MakaluMedia, we’ve succeeded in applying similar text classification principles to another unrelated problem area, with the intent of forcing the computer to do the tedious job it was invented for, allowing us super-apes, in turn, to spend more time under a palm tree on the nearby beach, sipping tinto de verano and working out answers to deep existential questions, or whatever else it is that one does on the beach (note to self: need more practice).

The exact details of this covert project will have to await its escape into the wild, should it ever evolve the capability for that. For the time being, some of the technically more interesting tidbits will have to do as fodder for my ramblings.

First, as with most of our internal development, and an ever-increasing percentage of our client projects, this system was developed using the high-productivity Ruby on Rails framework, and reached the magical 0.1.0 mark (i.e., pre-alpha, but usable enough to solve many of the developers’ own needs) in less than a man-week of intensive coding (not to forget the skimming of a good number of research papers related to the subject).

However, to ensure a permanent gap on the competition (after all, it sometimes seems like half the world has already jumped onto Rails), we also pulled out the big guns: the top-secret alien technology known as Lisp.

(Don’t be fooled by the devious code name, intended to confound us earthlings with spurious ideas relating to speech defects — this is seriously powerful stuff: exposure is guaranteed to subtly but permanently alter your brain structure in ways not yet fully understood. In fact, the aliens have theorized that the Universe may actually be one giant Quantum Lisp machine, explaining how it is possible that lots of irritating, seemingly superfluous parentheses can act as magic incantations conveying an apparently inexhaustible power as per the principle of Clarke’s Third Law. But that’s neither here nor there for our present purposes.)

In our case, we simply integrated into the Rails system an interpreter for a subset of Scheme, a Lisp derivative; thus no doubt confirming Greenspun’s Tenth Rule once again. (Well, to be fair, the Lisp interpreter in question is only some few hundred lines of fairly elegant Ruby code.)

The system’s top-level classification and scoring algorithms are implemented in this Lisp subset, allowing us to easily fine-tune and try out new tweaks at runtime, and perhaps in the future letting us semi-automatically pit various competing implementations against each other in a manner not dissimilar to a genetic algorithm.

Due to the combined RAD-factors of Rails and Lisp, we quickly proceeded through a number of intermediate prototypes along the way, starting from a short-lived Ferret-based implementation, evolving to a hand-rolled tokenizer and SQL-backed corpora storage, and eventually ending up with the current version that delegates the content classification to a Dr. Strangelove-inspired piece of excellent software called the CRM114 Controllable Regex Mutilator.

(Speaking of CRM114, I was surprised to not find any existing Ruby bindings for it, and thus took the time to transcribe a previous Python wrapper into a Ruby version, to be released shortly.)

Anyway, the system appears to work more or less according to spec, but definitely still needs some more tweaking before embarking on world domination. For one thing, all this number crunching is, well, rather heavy (let’s just say it’s a good thing we have A/C in the server room where the development box is located). Although CRM114 itself is pretty light on its feet, we’re dealing with an exponentially growing data set, and the next challenge will be to put some checks on resource consumption.

So, for the time being, we’re not going to let the development box interact with our space systems department, to prevent any non-regulated growth or inadvertent contact with the aliens. More updates to follow as they happen.

Note to self: too many tinto veranos can make you forget there’s a fine line between tongue-in-cheek surrealism and plain-bad geek humor.

Update 2006/11/06: I’ve released the Ruby interface to CRM114 on RubyForge.

One Response to “Spam Filters, Alien Technology and Ruby on Rails”

  1. Alex Says:

    Just ready about CRM114 today, looking forward to seeing your ruby wrapper.

Leave a Reply

MakaluMedia delivers success

Whether the objective is operator error minimization in a satellite tracking system, or the conversion of first-time visitors to buyers, MakaluMedia provides turn-key solutions that result in measurable benefits and positive return on investment for our customers. We help both small- and large organizations in the areas of business consulting, design (user interface, interactivity, corporate identity), system development and operations.

Contact us today. We look forward to hearing from you.