Friday, June 29, 2012

Progress of sorts

I had two goals in mind when I started thinking about this project. The first was to process only the newest haiku into searchable text and to append it to the existing archive. Previously I had been processing a folder with all the haiku I had written to date. That's fine in January, but by 29 June I've already written over 500! I reasoned that it should be fairly trivial to implement this and that it would far more efficient than re-processing older haiku day after day. The second was to eliminate the manual step currently required to initiate the processing of the new haiku - I wanted the act of sending an email to a dedicated folder to kick this off instead. Right now I have to execute a shell script to do this. Acting upon a change to a file is something Puppet is very good at; hence the name of this blog. I'm happy to report that I've mostly accomplished the first goal. Thanks to procmail, it was easy to create a rule to create a new mailbox with a solitary purpose: to hold new, unprocessed email. Once processed, the mailbox can be discarded. I was already using procmail to create a folder for all of the haiku I'd written since the beginning of the year. The messages in the folder are converted to individual HTML files by Hypermail. It was only a matter of adding 3 lines to my original shell script to process the new folder as well. I then modified my newer script to perform post-processing on only the newly-created HTML files and to append it to the existing archive. Therein lies the rub: I'm still kicking off a script to update the archive. It's much quicker now that it's only processing the haiku written since its last execution, but it would be so nice to have that already-completed act of sending an email initiate this process. And I refuse to believe that it can't be done.

Thursday, June 28, 2012

Or maybe not...

Something unexpected happened when I went to a Puppet forum and asked for guidance regarding the "puppetization" of my haiku archive. I was advised against it. By a number of people. Unanimously. Among the recommendations, write a procmail rule, inject into a local database, etc. I may have to go back to the proverbial drawing board - not that there was too much written there anyway...

Sunday, June 24, 2012

Getting Started

Well, my first step was to get puppet installed on my desktop. As my home network consists of only a desktop and a netbook, I decided to make my desktop both the puppet master and a puppet client. The Puppet Labs website discourages users from using the version of Puppet provided with one's Linux distribution of choice (mine being CrunchBang, a light-weight Debian Stable variation), so I created a puppet.list file under /etc/apt/sources.list.d consisting of only the following line:
deb http://apt.puppetlabs.com squeeze main
The subsequent "sudo apt-get update" complained about an invalid GPG key, but I ignored it. I ran "sudo apt-cache search | grep puppet" and noticed puppetlabs-release, which installed the GPG key and created an apt repo file of its own called puppetlabs.list. I deleted my initial repo file and reran the apt-get update, this time without errors. Then I installed the puppet and puppetmaster packages. The familiar /etc/puppet directory structure was there, but there were no files under manifests, modules or templates. This truly was starting from scratch.

Deep Breath...

Puppet and Poetry

I have written literally thousands of haiku since the mid-90s. Most of them are archived, thanks to a program called Hypermail, which converts individual emails to individual HTML files. Ever since September 1999 I have been emailing my haiku to a special email address, which, in turn, saves it in a dedicated email folder. Every so often I'll execute a shell script to process the haiku in the folder using the aforementioned Hypermail. I put the resulting HTML files in a directory and then ftp it to my website so that I can view archived email at www.haikupoet.com/archive/.

You'll notice that the archive starts at the beginning of the year. That's because Hypermail runs more slowly as more files are added to the folder. I got around this by renaming the archive folder at the end of each year and starting from scratch on January 1st. I have a separate folder for each year dating back to 1999. This made it easier to search archived haiku based on date, or sorted by subject (in my case also the first line). But if I wanted to search based upon a word appearing anywhere else in the haiku, I had to resort to onerous finds piped into greps - or primitive shell scripts.

Not one to reinvent the wheel, I contacted David G. Lanoue, whose Kobayashi Issa website includes a nifty search feature. Each of the Haiku Master's 10,000 poems were saved in a single CSV file, which were then searched using PHP code. My haiku, however, were not saved in a CSV file, but in individual HTML files, stored in multiple folders.

I busied myself with writing yet another shell script to process the the HTML files into a single text file. I added some post-processing using sed to translate unprintable characters and to strip extraneous text from the file. Then I taught myself just enough PHP to write a very simple search function, which I then added to my website. Victory at last!

...except that I still had to email each new haiku to myself and then use Hypermail to convert it to a new HTML file; and that I still had to process the resulting HTML file into a new line of text to be appended to the ever-growing file. The fact that I still write haiku daily - often several times a day - means that this is not a static archive, but rather a living document. I couldn't help think that the primitive techniques used to aggregate my haiku and make it available for searching mirrored some of the challenges I saw every day in the workplace. Scope creep: what had been a simple archive had evolved into a searchable archive; Scalability: what worked for dozens or hundreds of haiku is insufficient for thousands; Maintainability: the tools being used may not be around forever, after which the whole process breaks down.

There's also the issue of execution - it's in two parts. The shell script that invokes Hypermail was written in 1999. I usually run it manually at the command line, but I used to run it via cron - that is, until I decided to make the archive searchable. Now I have another more recent script that calls the first script and then concatenates all of the HTML files created this year into a single text file. I could "automate" this by running it once a night via cron, but what if I write several haiku during the course of a day and want the archive to be as up-to-date as possible at all times? What if I don't write anything for a day or two? The cron job is running in vain. Why isn't there an easy way to sense when I've added a haiku and then append it to the existing archive without a time trigger or a manual process?

Enter Puppet Labs. Their flagship product, Puppet, is software that enables systems administrators to automate configuration management and application deployment. My employer uses it for this and more, deploying and maintaining systems and application software to hundreds of servers in a sprawling, complex enterprise. Surely it's up to the task of automating updates to my haiku archive.

So here's what it needs to do: 1) detect a new email sent to my haiku archive address, 2) convert the email into a format readable and searchable on my website, and 3) append it to the existing data. Pretty easy, huh?

To do this, I'm going to need to know puppet much better than I do now. Like most lazy sys admins (which I realize might be a redundant term), I tend to copy an existing puppet configuration file and modify it for my own use. The puppet ecosystem we use in my workplace was put together by another team and handed to us. I've never built it out from scratch.

HyperMail is still available for free download from SourceForge, but it hasn't been updated since 2004. Who knows whether or not it will continue to be available? Besides, now that the goal is a single repository of searchable content, there's no need to have an interim step consisting of converting individual emails to individual HTML files that are then concatenated into a text file. Instead, each email should be processed as it arrives, directly into a searchable format, and added to the existing repository.

Puppet will work with an ever-increasing number of tools, so as the technology changes, the puppet code can change with it. I'll use puppet to detect the new email and to orchestrate its inclusion into the archive. Under the hood, I have some ideas on how to replace HyperMail (ActiveMQ? RSS?), as well as alternatives to a flat text file (MySQL? NOSQL?). The PHP code would need to change in order to search a database instead of a text file, but maybe I'll use a programming language like Ruby instead.

I don't know how to do many of the things I've suggested above, but I can't wait to get started...