An Architect's View

CFML, Clojure, Software Design, Frameworks and more...

An Architect's View

Parsing PowerMTA Accounting Files

July 2, 2011 ·

At World Singles, like many other companies, we use a Power MTA server to handle the email we send out - because we send a lot of email every day. When you have a large user community that you email regularly, one of the problems you have to deal with is that users change or abandon their email accounts and you often end up with a lot of stale email addresses that you need to process. The Power MTA server produces a daily accounting log file that is a giant CSV file containing the delivery status of every email it sends. If you're sending a lot of email, these accounting files can be very big. We send close to half a million emails a day so our accounting files are about 200MB.

To automatically handle bounced emails, all we need to do is extract the list of email addresses for which the Power MTA server was unable to deliver messages. What is a CSV file? It's a sequence of records, with each record representing the columns in one line of the CSV file. Given the size of the files, we don't want to load the whole thing into memory and convert it to a data structure - we'd quickly run out of memory! Fortunately, Clojure has lazy data structures which allow us to process large amounts of data, one piece at a time.

David Santiago has created a very nice abstraction for CSV files - clojure-csv - and very quickly responded to my request to update the library to work with Clojure 1.3.0, which is what we're using at World Singles. With just that library, here's what it takes to extract all the email addresses from a 200MB CSV file without using a great deal of memory:

(ns worldsingles.data.mail-bounce
  (:require [clojure.java.io])
  (:require [clojure-csv.core :as csv]))

(defn to-csv [file]
  (csv/parse-csv (csv/char-seq (clojure.java.io/reader file))))

(defn get-bounces [csv]
  (filter #(= (first %) "b") csv))

(defn get-bouncing-emails [bounces]
  (map #(% 5) bounces))

(def test-file "/Developer/workspace/worldsingles/ws/pmta/acct-2011-06-29-0000.csv")

(count (get-bouncing-emails (get-bounces (to-csv test-file))))

This is just a proof of concept to show the feasibility of such parsing. For production code, the use of reader should be wrapped in with-open to ensure the file handle is closed after the data is processed - and of course the actual list of emails needs to be processed against our member database so that we can flag bouncing email addresses. So, how does it work?

to-csv opens the file with a reader, produces a (lazy) character sequence and then parses that sequence to produce a (lazy) sequence of vectors, where each vector is a row of the file.

get-bounces filters the sequence of vectors to return just those marked as bounces (first column is "b"). Again, it's a lazy sequence.

get-bouncing-emails extracts just column 5 (numbered from 0) which represents the recipient ("rcpt" in the original CSV file), again as a lazy sequence.

I picked an arbitrary CSV file and then ran that last line which: converts the test file to a CSV sequence, filters it to get just bounced records, extracts just the email address and counts how many addresses we found. That line took about 45 seconds to process just over 450,000 records in an almost 200MB CSV file. A total of 46,000 email attempts bounced. (count (distinct (get-bouncing-emails (get-bounces (to-csv test-file))))) told me there were about 38,500 unique email addresses in that list.

Tags: clojure

8 responses

  • 1 Raymond Camden // Jul 3, 2011 at 6:37 AM

    That's fascinating Sean. Thanks for sharing it.
  • 2 Tobias Herkula // Jul 5, 2011 at 5:40 PM

    You could handle this even easier, with the PowerMTA Accounting File Directive. You could specify that the PMTA should create a separate acct file, only with bounced email addresses and without any other data. The only thing you have to do than, is to distinct the values with grep or any other tool and then mark your 'dead' recipients.
  • 3 Fred Tabsharani // Jul 5, 2011 at 6:10 PM

    Tobias wrote: "But why they don't use the Accounting File Directive to create a separate file only with bounced Mail Addresses?"
  • 4 Sean Corfield // Jul 6, 2011 at 2:57 PM

    @Tobias, @Fred, we initially looked at using the Java API for PMTA but it didn't look like you could read the accounting data remotely and it was easier to just use WinSCP to sync the files to a Linux box where we could process them however we want.

    Right now we only pull bounces out but we might want all email delivery results at some point.

    This didn't take long to code up - less time than it would take to read the PMTA manual :)

    I do expect to discuss additional options with someone at Port 25 tho' (Vlad at EWH suggested Scott as an initial contact).
  • 5 Scott Habicht // Jul 7, 2011 at 6:26 AM

    Drop us a line any time, we're more than happy to help.
  • 6 Wayne Mehl // Jul 14, 2011 at 8:50 AM

    We are doing a proof of concept with the new PMTA async processing, in addition to the sync bounces in the existing product. We are injesting the data in to MSFT SQL, as our core product is based on SQL server. It will be interesting to see the results.
  • 7 Sean Corfield // Jul 14, 2011 at 10:27 AM

    I talked with Scott about the options and we've decided to go with a relatively simple solution that has WinSCP on the PMTA server, scripted to sync the log directory out to one of our Linux boxes and then run the Clojure program via cron (using Leiningen). A couple of minutes processing a day and we have lots of data in MySQL to work from, producing graphs as well as marking user profiles that have bouncing / invalid email addresses.
  • 8 Ben // Aug 27, 2011 at 12:09 AM

    Awesome post! I'm actually doing something pretty unique. You can configure the accounting files to pipe to an executable file instead of writing CSVs. I have a python script reading the STDIN, parsing what I need and pushing it into a MongoDB database. ( I've also used SQLite). Then I have a cron running every few minutes, assessing the severity of the bounce and removing the hard bounces.