Sun May 29, 2005

Regexps Rule

Regular readers will know that my list of indispensable applications includes the Emacs text editor, the TeX/LaTeX typesetting system,and a whole array of ancillary utilities that make the two play nice together. The goal is to produce beautiful and maintainable documents. Also it gives Dan further opportunity to defend Microsoft Office. I am happy to admit that a love of getting the text to come out just so can lead to long-run irrationalities. The more complex the underlying document gets, the harder it is to convert it to some other format. And we all know which format we mean.

Well, yesterday morning the long run arrived: I finished the revisions to my book manuscript and it was now ready to send to the publisher for copyediting. Except for one thing. The University of Chicago Press is not interested in parsing complex LaTeX files. They are quite clear about what they want, and it isn’t unreasonable. I had a horrible vision of spending weeks manually futzing with a book’s worth of formatted text. But thanks largely to the awesome power of regular expressions, or regexps, and the availability of free tools that implement them, the whole thing was pretty painless.

There is no truly satisfactory way to convert a LaTeX document to something that retains the document’s structure and formatting and that Microsoft Word can reliably read. By far the best option is to convert to HTML first, using something like Hevea. LaTeX and HTML are both markup languages and their specifications are publicly available (unlike Microsoft’s DOC format), so the conversion tools are good. If your LaTeX document was fairly straightforward, then Hevea would probably do the trick for you, even if you fed it a book manuscript. But if, like me, you use fancy-pants LaTeX stuff like Jurabib and the Memoir Class then you are out of luck, because Hevea doesn’t know about any of that. I wanted to keep the bells and whistles because they allowed for a sane approach to the structure of the document, especially the notes. The author-year citation method is terrible for notes in a book, as is the insane ibid/idem system. Jurabib can automatically create notes that cite the full reference the first time a work is cited and a shortened reference (the author’s surname and an abbreviated title, say) thereafter. So with that as a given, Hevea was out.

The Memoir document class is a superb piece of work, and one of the many things it can do is produce manuscript-style output—- that is, something that looks like it was prepared on a typewriter, set ragged right and double spaced in a monospace font with no hyphenation. My original naive hope was that Adobe Acrobat could take this PDF output in this form and produce something serviceable using its “Save as RTF” or “Save as Microsoft Word” options. Acrobat dutifully saved the PDF as a Word file, alright. It took forever, but when I opened the result in Word it looked fantastic—like a carbon copy of the original. This is because it was a carbon copy. Acrobat had simply converted every page to an image and stuck the result in a Word file! I should have known better, I suppose. Getting a structured, editable document out of a PDF file seems like a hard task: it’s been likened to producing a live pig from a packet of sausages. The best Acrobat could do was save the text, without any formatting at all, or even any spaces between the words. Not very helpful.

Things weren’t looking good. The manuscript-like output was great—- I just needed a way to get it into an editable format. Then I wondered whether the manuscript could be marked up in a way that signaled the most important formatting in some obvious way. The most important formatting elements to keep were the paragraph breaks, section headings and the italics that made the notes and bibliography readable. No way was I manually typing any of those. They were all in a BibTeX database, anyway. A quick refresher course later, and the solution began to take shape. I redefined LaTeX’s section commands so that they had an appropriate callout signaling themselves, as required. I wrote simple expressions to do stuff like put three asterisks at the beginning of every paragraph and three exclamation points on either side of every italicized word.

That just left the problem of the notes and bibliography. These are automatically generated and so are not available in the tex file anywhere to have formatting tags attached via a regexp. It turns out, though, that the remarkable BibTool has the ability to do regular expression operations on individual fields in your BibTeX database. It can do much else besides, too: it was able, for example, to extract all and only the references cited in the book, make a .bib file out of them, and then insert the relevant manual markup in the right places for books, articles and so on. I am constantly amazed by how much excellent free software there is in the world. Maybe the next book will be about that.

Having done all that, I re-created the manuscript in its new Frankenmarkup form. It was a big mess. But it had structure, which was the important thing. It turns out that Word has its own regular expression engine, and I was able to use that—along with the macro recording feature—to take out the thousands of useless paragraph symbols, reconstruct the real paragraphs and formatted text, put the section headings in, and reconstitute superscripted note numbering and formatted notes, and create the bibliography right down to the six hyphens preferred when citing multiple works by the same author. I did use Hevea for one thing—translating the four or five tables into HTML. Some of them were quite complex, but it worked perfectly. Word read them without a problem and its ‘Autoformat’ feature gave them a standard look.

The whole thing went far more smoothly, and far faster, than I had any right to expect. In retrospect, of course, I’d urge LaTeX dweebs to lay off the bells and whistles when producing long manuscripts. It’s a waste of time. But so is watching TV, and you probably do a lot of that, too. If you have a standard book or article with little in the way of add-ons, it should be relatively easy to convert it to HTML. (Daniel will be along momentarily in the comments to tell you to start using Word in the first place.) But even a complex document can be handled quickly and with relatively little fuss.

I should add, by the way, that the people at Chicago have been absolutely great throughout the process, and I’ve known for ages that the Day of Reckoning would come when I had to convert the manuscript. I also know that, in the end, Chicago’s designers and typesetters will do a much better job than me when it comes to producing a good-looking book. They really know how to do it, whereas all my efforts to get TeX to do nice things are really just amateurish messing about.

The result is now sitting in a burn folder on my Desktop. The Word files look uglier than a NASCAR driver afraid of losing to a girl, but they conform to the submission guidelines and the whole thing ready to be put in the post first thing after Memorial Day.