I've touched on this in an earlier blog post about parsing PowerMTA accounting files but I wanted to cover it again as part of this blog series and go into a bit more depth. In that earlier post, I showed an outline for processing the large CSV files produced by PowerMTA but there are some interesting smaller pieces of the puzzle that warrant coverage too. The log files are sync'd from the PowerMTA Windows server to a directory on one of our Linux servers. The first part of the task is to get a list of just the CSV files (we sync other log files too but right now we only want the CSV files). The java.io.File object has a list() method that takes a java.io.FilenameFilter - an interface defining an accept() function, taking a directory (as a java.io.File object) and a filename (as a java.lang.String) and returning true if the file should be included in the list. I want to pass a regex instead, and get a sorted list back:
(defn- wildcard-filter "Given a regex, return a FilenameFilter that matches." [re] (reify java.io.FilenameFilter (accept [_ dir name] (not (nil? (re-find re name)))))) (defn- directory-list "Given a directory and a regex, return a sorted seq of matching filenames." [dir re] (sort (.list (clojure.java.io/file dir) (wildcard-filter re))))
directory-list is an interesting mix of Clojure and Java interop: (clojure.java.io/file dir) is a convenient wrapper that returns a java.io.File object representing the file or directory; (.list ...) is a call to the native Java method on that File object; (wildcard-filter re) is a Clojure call - wildcard-filter uses reify to create an anonymous object on-the-fly which implements the java.io.FilenameFilter interface (and uses the regex to determine whether the filename matches; finally the resulting String - returned by File.list() - is sorted as a Clojure sequence.
For each file in that list, we run this function to process the CSV data and return counts of delivered and bounced messages:
(defn- process-pmta-accounting-file "Process a single PMTA accounting file. Return the number of delivered / bounced records processed." [directory file] (let [file-path (str directory "/" file)] (reset! failures 0) (with-open [r (clojure.java.io/reader file-path)] (reduce count-email [0 0] (map process-bounce-record (to-csv r))))))
This is very similar to the code in my previous blog post. It opens the CSV file, parses it using the clojure-csv library, calls process-bounce-record on each row of data and then accumulates delivered and bounced counts (process-bounce-record returns a bounced / delivered indicator, count-email takes a pair of numbers and an indicator and returns a pair of numbers with either the first or second incremented, according to the indicator). We also record (in an atom) the number of times that process-bounce-record fails to update our email status table in MySQL. Looking at this code today, particularly the parts I'm not showing you, it could be cleaned up quite a bit and made more idiomatic. It gets the job done, however, and it has been happily processing close to half a million lines of CSV file every day for about four months in production. It's simple, fairly elegant and reasonably fast.