How To: Convert mbox to mailman archives using procmail

Posted by – December 10, 2008

Update: NATURALLY, after doing all of this I learned that I was given the wrong information. Turns out Mailman is more than happy to take a huge mbox file as input for the arch script. I did learn that running clean_arch on the mbox first is a good idea…

mailman Are you like me? Do you get upset when you have to deal with an almost decade old problem that you had nothing to do with? Well then I’ve got a story for you…

So we’ve got these archives…

Our organization was using Lyris ListServ for about the past 10 years to handle all of our discussion list. Like most MLM‘s, ListServ does have the ability to keep list archives…but naturally we opted to not use them for all of our lists. Big mistake.

Instead of list archives we have a user on our webserver called ‘archive’. Archive is subscribed to each and every list and gets copies of all the messages. When the messages come in, Archive processes them with a procmail script and separates them into mbox mailboxes for each of the lists.

Each of the mails are then piped to a program called mhonarc which converts them into html and provides an index, etc – which can be displayed on our current(old) website. But now we’ve got a new website coming up…

Enter GNU Mailman

Me:

Thank you for coming Mailman. I’m really glad to have you because you do a really good job, not to mention you’re free and uber-powerful… One thing though… We’ve got these uh, gulp “archives”. We, uh, need to keep them and everything but you know, they’re like, not in the greatest shape. See, there actually in mbox format…

Mailman:

Oh yeah, that’s not a problem at all. I’ve got a built in script to to do that. Just take all of the monthly mailbox files for each of your lists and drop them in my folder – I’ll knock them out in no time!

Me:

SWEET! But what did you say about monthly whatevers?

Mailman:

The mailbox files that you create every month for each list… You are using your procmail script to start a new mailbox file every month aren’t you? Putting 10 years worth of emails into a single monolithic file would be retarded…

Me:

Oh yeah yeah… Of course we did that. I thought you were talking about something else. Silly me. Anyway, so uh, yeah, I’ll get those file to you real soon.

Breaking up the monolith

So clearly I needed to edit the procmail script a little bit and reprocess all the mail – but WTH?

Last month I had a help desk ticket come my way about a list which was not appearing on the website, and hadn’t been for a number of months. After digging around, I realized that someone (almost certainly me) had made a mistake in the .procmailrc file which had kept it from processing mail for that list. This was embarrasing, but I discovered how to reprocess mail with formail.

I knew I could probably reprocess all of the mail (many thousands) but had absolutely no clue how to do it. I had only this one clue from my procmail recipe:

<pre>
LOGFILE=$PMDIR/list_archive-`date +%Y-%m`.log
</pre>

They had written it to rotate the logs, but not the mailbox names! Uncool! But at least I had my answer – `date +%Y-%m` can get the date into the names…But wait!

When I rewrote my procmail file like so:

<pre>
#this is just one of many
:0 E
     * ^Sender:.*LIST-L
     {
       :0 c
       LIST-L.`date +%Y%m` #to match mailmans archive format...
     }
</pre>

Totally not working! It created only one file, and the date was this month and this year.

I’ll fast-forward for the benefit of the reader at this point and just share an insight with you: date is a *nix command and has nothing to do with procmail and cannot get any data out of the emails themselves – like dates! Yes that’s right, you can put anything you want in between those little ticks, but because the date command only returns the system date, we’ve got to do two things:

  1. Get the date the email was sent out of the email header (Magic)
  2. Process the date field to have only the 4 digit year and 2 digit month (More Magic)

It gets easier from here…

Getting the date field

The first thing we need to do is nab the Date: header from the emails. This part is fairly straight forward. Procmail uses a variable $MATCH to hold the matched string for the rule that it’s matching on. We can use this to hold our Date header and then just pipe it to a script for processing.

Here’s the recipe magic!

<pre>
# NOTE: I later found that this rule only seems to work
# when used as an ELSE rule (:0 E). I'm not sure why, but
# it was only matching the 'Date:' portion, and not the entire
# line. If you can help me understand why, please leave a comment.
# ANOTHER NOTE: The ticks in `echo $MATCH.... are ticks(the
# un-shifted tilde) and not single quotes.

:0 E
     * ^Sender:.*LIST-L
     {
       :0 c
       * ^\/Date:.*
       LIST-L.`echo $MATCH | php /path/to/dateconvert.php`
     }
</pre>

Transforming the date into Mailman’s monthly format

Here’s the php code more magic to get your date cut down and switched around to a format Mailman will love. It uses php’s built in functions fairy dust to put the Date header into an array, and drop the empty elements. It was also necessary to create an array that maps the three letter month names used in the email header to their numerical equivalents.

<pre>
<?php
 $date = trim(fgets(STDIN));
 
 $datearray = array_values(array_filter(explode(" ", $date)));
 
 $month = $datearray[3];
 $year = $datearray[4];
 
 $montharray = array(
 "Jan" => "01",
 "Feb" => "02",
 "Mar" => "03",
 "Apr" => "04",
 "May" => "05",
 "Jun" => "06",
 "Jul" => "07",
 "Aug" => "08",
 "Sep" => "09",
 "Oct" => "10",
 "Nov" => "11",
 "Dec" => "12"
 );
 
 echo $year . $montharray[$datearray[3]];
?>
</pre>

At this point you’re probably thinking one of two things:

  1. “ZOMG you’re such a hack. You could have done that with so much less code! You have no style!”
  2. “ZOMG you can totally do that with a sed/awk one liner!”

Sorry for wasting everyone’s time yet again.

After you’ve got the recipe in place and the php file all ready to go, just give it one of these:



formail -s procmail -m /path/to/yourprocmailfile < /path/to/LIST-L

Final product

Anyway, you should, after a few hours or days end up with a grip of mailbox files like this:

<pre>
               LIST-L.200306  LIST-L.200502  LIST-L.200701
           LIST-L.200307  LIST-L.200503  LIST-L.200702
LIST-L.200112  LIST-L.200308  LIST-L.200504  LIST-L.200703
LIST-L.200201  LIST-L.200309  LIST-L.200505  LIST-L.200704
LIST-L.200202  LIST-L.200310  LIST-L.200506  LIST-L.200705
LIST-L.200203  LIST-L.200311  LIST-L.200507  LIST-L.200707
LIST-L.200204  LIST-L.200312  LIST-L.200509  LIST-L.200708
LIST-L.200205  LIST-L.200401  LIST-L.200511  LIST-L.200709
LIST-L.200206  LIST-L.200402  LIST-L.200512  LIST-L.200710
LIST-L.200207  LIST-L.200403  LIST-L.200601  LIST-L.200711
LIST-L.200208  LIST-L.200404  LIST-L.200602  LIST-L.200712
LIST-L.200209  LIST-L.200405  LIST-L.200603  LIST-L.200801
LIST-L.200210  LIST-L.200406  LIST-L.200604  LIST-L.200802
LIST-L.200211  LIST-L.200407  LIST-L.200605  LIST-L.200803
LIST-L.200212  LIST-L.200408  LIST-L.200606  LIST-L.200804
LIST-L.200301  LIST-L.200409  LIST-L.200607  LIST-L.200805
LIST-L.200302  LIST-L.200410  LIST-L.200608  LIST-L.200806
LIST-L.200303  LIST-L.200411  LIST-L.200610  LIST-L.200807
LIST-L.200304  LIST-L.200412  LIST-L.200611
LIST-L.200305  LIST-L.200501  LIST-L.200612
</pre>

So there ya go. Now make with the comments.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">