Sunday, June 24, 2012

Puppet and Poetry

I have written literally thousands of haiku since the mid-90s. Most of them are archived, thanks to a program called Hypermail, which converts individual emails to individual HTML files. Ever since September 1999 I have been emailing my haiku to a special email address, which, in turn, saves it in a dedicated email folder. Every so often I'll execute a shell script to process the haiku in the folder using the aforementioned Hypermail. I put the resulting HTML files in a directory and then ftp it to my website so that I can view archived email at www.haikupoet.com/archive/.

You'll notice that the archive starts at the beginning of the year. That's because Hypermail runs more slowly as more files are added to the folder. I got around this by renaming the archive folder at the end of each year and starting from scratch on January 1st. I have a separate folder for each year dating back to 1999. This made it easier to search archived haiku based on date, or sorted by subject (in my case also the first line). But if I wanted to search based upon a word appearing anywhere else in the haiku, I had to resort to onerous finds piped into greps - or primitive shell scripts.

Not one to reinvent the wheel, I contacted David G. Lanoue, whose Kobayashi Issa website includes a nifty search feature. Each of the Haiku Master's 10,000 poems were saved in a single CSV file, which were then searched using PHP code. My haiku, however, were not saved in a CSV file, but in individual HTML files, stored in multiple folders.

I busied myself with writing yet another shell script to process the the HTML files into a single text file. I added some post-processing using sed to translate unprintable characters and to strip extraneous text from the file. Then I taught myself just enough PHP to write a very simple search function, which I then added to my website. Victory at last!

...except that I still had to email each new haiku to myself and then use Hypermail to convert it to a new HTML file; and that I still had to process the resulting HTML file into a new line of text to be appended to the ever-growing file. The fact that I still write haiku daily - often several times a day - means that this is not a static archive, but rather a living document. I couldn't help think that the primitive techniques used to aggregate my haiku and make it available for searching mirrored some of the challenges I saw every day in the workplace. Scope creep: what had been a simple archive had evolved into a searchable archive; Scalability: what worked for dozens or hundreds of haiku is insufficient for thousands; Maintainability: the tools being used may not be around forever, after which the whole process breaks down.

There's also the issue of execution - it's in two parts. The shell script that invokes Hypermail was written in 1999. I usually run it manually at the command line, but I used to run it via cron - that is, until I decided to make the archive searchable. Now I have another more recent script that calls the first script and then concatenates all of the HTML files created this year into a single text file. I could "automate" this by running it once a night via cron, but what if I write several haiku during the course of a day and want the archive to be as up-to-date as possible at all times? What if I don't write anything for a day or two? The cron job is running in vain. Why isn't there an easy way to sense when I've added a haiku and then append it to the existing archive without a time trigger or a manual process?

Enter Puppet Labs. Their flagship product, Puppet, is software that enables systems administrators to automate configuration management and application deployment. My employer uses it for this and more, deploying and maintaining systems and application software to hundreds of servers in a sprawling, complex enterprise. Surely it's up to the task of automating updates to my haiku archive.

So here's what it needs to do: 1) detect a new email sent to my haiku archive address, 2) convert the email into a format readable and searchable on my website, and 3) append it to the existing data. Pretty easy, huh?

To do this, I'm going to need to know puppet much better than I do now. Like most lazy sys admins (which I realize might be a redundant term), I tend to copy an existing puppet configuration file and modify it for my own use. The puppet ecosystem we use in my workplace was put together by another team and handed to us. I've never built it out from scratch.

HyperMail is still available for free download from SourceForge, but it hasn't been updated since 2004. Who knows whether or not it will continue to be available? Besides, now that the goal is a single repository of searchable content, there's no need to have an interim step consisting of converting individual emails to individual HTML files that are then concatenated into a text file. Instead, each email should be processed as it arrives, directly into a searchable format, and added to the existing repository.

Puppet will work with an ever-increasing number of tools, so as the technology changes, the puppet code can change with it. I'll use puppet to detect the new email and to orchestrate its inclusion into the archive. Under the hood, I have some ideas on how to replace HyperMail (ActiveMQ? RSS?), as well as alternatives to a flat text file (MySQL? NOSQL?). The PHP code would need to change in order to search a database instead of a text file, but maybe I'll use a programming language like Ruby instead.

I don't know how to do many of the things I've suggested above, but I can't wait to get started...

No comments:

Post a Comment