Update: NATURALLY, after doing all of this I learned that I was given the wrong information. Turns out Mailman is more than happy to take a huge mbox file as input for the arch script. I did learn that running clean_arch on the mbox first is a good idea…
Are you like me? Do you get upset when you have to deal with an almost decade old problem that you had nothing to do with? Well then I’ve got a story for you…
So we’ve got these archives…
Our organization was using Lyris ListServ for about the past 10 years to handle all of our discussion list. Like most MLM’s, ListServ does have the ability to keep list archives…but naturally we opted to not use them for all of our lists. Big mistake.
Instead of list archives we have a user on our webserver called ‘archive’. Archive is subscribed to each and every list and gets copies of all the messages. When the messages come in, Archive processes them with a procmail script and separates them into mbox mailboxes for each of the lists.
Each of the mails are then piped to a program called mhonarc which converts them into html and provides an index, etc – which can be displayed on our current(old) website. But now we’ve got a new website coming up…
Enter GNU Mailman
Me:
Thank you for coming Mailman. I’m really glad to have you because you do a really good job, not to mention you’re free and uber-powerful… One thing though… We’ve got these uh, gulp “archives”. We, uh, need to keep them and everything but you know, they’re like, not in the greatest shape. See, there actually in mbox format…
Mailman:
Oh yeah, that’s not a problem at all. I’ve got a built in script to to do that. Just take all of the monthly mailbox files for each of your lists and drop them in my folder – I’ll knock them out in no time!
Me:
SWEET! But what did you say about monthly whatevers?
Mailman:
The mailbox files that you create every month for each list… You are using your procmail script to start a new mailbox file every month aren’t you? Putting 10 years worth of emails into a single monolithic file would be retarded…
Me:
Oh yeah yeah… Of course we did that. I thought you were talking about something else. Silly me. Anyway, so uh, yeah, I’ll get those file to you real soon.
Breaking up the monolith
So clearly I needed to edit the procmail script a little bit and reprocess all the mail – but WTH?
Last month I had a help desk ticket come my way about a list which was not appearing on the website, and hadn’t been for a number of months. After digging around, I realized that someone (almost certainly me) had made a mistake in the .procmailrc file which had kept it from processing mail for that list. This was embarrasing, but I discovered how to reprocess mail with formail.
I knew I could probably reprocess all of the mail (many thousands) but had absolutely no clue how to do it. I had only this one clue from my procmail recipe:
LOGFILE=$PMDIR/list_archive-`date +%Y-%m`.log
They had written it to rotate the logs, but not the mailbox names! Uncool! But at least I had my answer – `date +%Y-%m` can get the date into the names…But wait!
When I rewrote my procmail file like so:
#this is just one of many
:0 E
* ^Sender:.*LIST-L
{
:0 c
LIST-L.`date +%Y%m` #to match mailmans archive format...
}
Totally not working! It created only one file, and the date was this month and this year.
I’ll fast-forward for the benefit of the reader at this point and just share an insight with you: date is a *nix command and has nothing to do with procmail and cannot get any data out of the emails themselves – like dates! Yes that’s right, you can put anything you want in between those little ticks, but because the date command only returns the system date, we’ve got to do two things:
- Get the date the email was sent out of the email header (Magic)
- Process the date field to have only the 4 digit year and 2 digit month (More Magic)
It gets easier from here…
Getting the date field
The first thing we need to do is nab the Date: header from the emails. This part is fairly straight forward. Procmail uses a variable $MATCH to hold the matched string for the rule that it’s matching on. We can use this to hold our Date header and then just pipe it to a script for processing.
Here’s the recipe magic!
# NOTE: I later found that this rule only seems to work
# when used as an ELSE rule (:0 E). I'm not sure why, but
# it was only matching the 'Date:' portion, and not the entire
# line. If you can help me understand why, please leave a comment.
# ANOTHER NOTE: The ticks in `echo $MATCH.... are ticks(the
# un-shifted tilde) and not single quotes.
:0 E
* ^Sender:.*LIST-L
{
:0 c
* ^\/Date:.*
LIST-L.`echo $MATCH | php /path/to/dateconvert.php`
}
Transforming the date into Mailman’s monthly format
Here’s the php code more magic to get your date cut down and switched around to a format Mailman will love. It uses php’s built in functions fairy dust to put the Date header into an array, and drop the empty elements. It was also necessary to create an array that maps the three letter month names used in the email header to their numerical equivalents.
< ?php
$date = trim(fgets(STDIN));
$datearray = array_values(array_filter(explode(" ", $date)));
$month = $datearray[3];
$year = $datearray[4];
$montharray = array(
"Jan" => "01",
"Feb" => "02",
"Mar" => "03",
"Apr" => "04",
"May" => "05",
"Jun" => "06",
"Jul" => "07",
"Aug" => "08",
"Sep" => "09",
"Oct" => "10",
"Nov" => "11",
"Dec" => "12"
);
echo $year . $montharray[$datearray[3]];
?>
At this point you’re probably thinking one of two things:
- “ZOMG you’re such a hack. You could have done that with so much less code! You have no style!”
- “ZOMG you can totally do that with a sed/awk one liner!”
Sorry for wasting everyone’s time yet again.
After you’ve got the recipe in place and the php file all ready to go, just give it one of these:
formail -s procmail -m /path/to/yourprocmailfile < /path/to/LIST-L
Final product
Anyway, you should, after a few hours or days end up with a grip of mailbox files like this:
LIST-L.200306 LIST-L.200502 LIST-L.200701
LIST-L.200307 LIST-L.200503 LIST-L.200702
LIST-L.200112 LIST-L.200308 LIST-L.200504 LIST-L.200703
LIST-L.200201 LIST-L.200309 LIST-L.200505 LIST-L.200704
LIST-L.200202 LIST-L.200310 LIST-L.200506 LIST-L.200705
LIST-L.200203 LIST-L.200311 LIST-L.200507 LIST-L.200707
LIST-L.200204 LIST-L.200312 LIST-L.200509 LIST-L.200708
LIST-L.200205 LIST-L.200401 LIST-L.200511 LIST-L.200709
LIST-L.200206 LIST-L.200402 LIST-L.200512 LIST-L.200710
LIST-L.200207 LIST-L.200403 LIST-L.200601 LIST-L.200711
LIST-L.200208 LIST-L.200404 LIST-L.200602 LIST-L.200712
LIST-L.200209 LIST-L.200405 LIST-L.200603 LIST-L.200801
LIST-L.200210 LIST-L.200406 LIST-L.200604 LIST-L.200802
LIST-L.200211 LIST-L.200407 LIST-L.200605 LIST-L.200803
LIST-L.200212 LIST-L.200408 LIST-L.200606 LIST-L.200804
LIST-L.200301 LIST-L.200409 LIST-L.200607 LIST-L.200805
LIST-L.200302 LIST-L.200410 LIST-L.200608 LIST-L.200806
LIST-L.200303 LIST-L.200411 LIST-L.200610 LIST-L.200807
LIST-L.200304 LIST-L.200412 LIST-L.200611
LIST-L.200305 LIST-L.200501 LIST-L.200612
So there ya go. Now make with the comments.


