Sat Dec 11, 2004

Indispensible Applications

Picking up on an old item over at 43 Folders (this post has been marinading for a while), here’s a discussion of the applications and tools I use to get work done. I do get work done, sometimes. Honestly.

I’ll give you two lists. The first contains examples of software I find really useful, but which doesn’t directly contribute to the work I’m supposed to be doing. (Some of it actively detracts from it, alas.) The second list is comprised of the applications I use to do what I’m paid for, and it might possibly interest graduate students in departments like mine. If you just care about the latter list, then a discussion about choosing workflow applications [pdf] might also be of interest. (That note overlaps with this post: it doesn’t contain the first list, but adds some examples to the second.) If you don’t care about any of this, well, just move along quietly.

Why this matters

You can do productive, maintainable and reproducible work with all kinds of different software set-ups. This is the main reason I don’t go around encouraging everyone to convert to the group of applications I myself use. (My rule is that I don’t try to persuade anyone to switch if I can’t commit to offering them technical support during and after their move.) So this discussion is not geared toward convincing you there is One True Way to do your work. I do think, however, that if you’re in the early phase of your career as a graduate student in, say, Sociology or Political Science, you should give some thought to how you’re going to organize and manage your work. This is so for two reasons. First, the transition to graduate school is a good time to make a switch in your software platform. Early on, there’s less inertia and cost associated with switching things around than there will be later. Second, in the social sciences, text and data management skills are usually not taught explicitly. This means that you may end up adopting the practices of your advisor or mentor, continue to use what you’re taught in your methods classes, or just copy whatever your peers are using. Following any one of these paths may lead you to an arrangement that you’re happy with. But not all solutions are equally useful or powerful, and you can find yourself locked-in to a less-than-ideal setup quite quickly.

Although I’m going to describe some specific applications, in the end it’s not really about the software. For any kind of formal data analysis that leads to a scholarly paper, however you do it, there are basic principles that you’ll want to adhere to. The main one, for example, is never do anything interactively. Always write it down as a piece of code or an explicit procedure instead. That way, you leave the beginnings of an audit trail and document your own work to save your future self six months down the line from hours spent wondering what the hell it was you thought you were doing. A second principle is that a file or folder should always be able to tell you what it is—i.e., you’ll need some method for organizing and documenting papers, code, datasets, output files or whatever it is you’re working with. A third principle is that repetitive and error-prone processes should be automated as much as possible. This makes it easier to check for mistakes. Rather than copying and pasting code over and over to do basically the same thing to different parts of your data, write a general function that can be called whenever it’s needed. This idea applies even when there’s no data analysis. It pays to have some system to automatically generate and format the bibliography in a paper, for example. There are many ways of implementing these principles. You could use Microsoft Word, Endnote and SPSS. Or Textpad and Stata. Or a pile of legal pads, a calculator, a pair of scissors and a box of file folders. It’s the principles that matter. But software applications are not all created equal, and some make it easier than others to do the Right Thing. For instance, it is possible to produce well-structured, easily-maintainable documents using Microsoft Word, but you have to use its styling and outlining features strictly and responsibly. Most people don’t bother to do this. So it’s probably a good idea to invest some time learning about the alternatives, especially if they are free or very cheap to try.

Day-to-Day

These are applications that I use routinely but fall outside the core “Workflow” category. A lot of other people use them too, because they’re good (or the best) tools for everyday jobs. All of them are Mac OS X applications.

  • Quicksilver. A fantastic application launcher, file-finder, task-executer and other-stuff-doer. It took about two days for it to become the natural way for me to carry out all kinds of tasks. Quicksilver gives you automatic keyboard shortcuts for most of the entities on your hard drive (files, folders, applications, addresses, music tracks and playlists, bookmarks, etc), and then lets you perform (and chain together) many different sorts of actions on those entities: find files or email addresses, launch applications, attach files to email, find addresses or phone numbers, play music, append text to files, and lots of other stuff, too. To paraphrase a post I forgot to bookmark, Quicksilver is the kind of application that you get used to using immediately and, pretty soon, any computer you sit in front of that doesn’t have it installed seems broken. It’s free. Read more about it.

  • Safari. Yer basic Apple browser. Works great. Apart from not using Explorer, I never could get into the Browser Wars.

  • NetNewsWire. The best way to keep up with all them bloggers.

  • Ecto. The best way to be one of the bloggers. Manages my posts to this blog. May face competition in future from Mars Edit.

  • CalendarClock. Replaces your system clock and, as well as showing you the time, lets you see your iCal calendar, appointments and to-dos in a handy drop-down menu. Very handy. I have an older, free version but now there’s an updated commercial version.

  • Mail.app. I’m sure my email should be more organized and I should have all kinds of filters in place and all the rest of it, but Apple’s bundled application does what I (think I) need.

  • Terminal.app Mac OS’s built in terminal is just the thing for when you want to use the unix command line. It. Just. Needs. Tabs.

Workflow Essentials

These applications form the core of my own work environment—i.e., the things I need (besides ideas, data and sharp kick) to write papers. Papers will generally contain text, the results of data analysis (in Tables or Figures) and the scholarly apparatus of notes and references. I want to be able to easily edit text, analyze data and minimize error along the way. I like to do this without switching in and out of different applications. All of these applications are freely available for Mac OS X, Windows, and Linux (and other more esoteric platforms, too).

Edit Text.

  • Emacs. A text editor, in the same way the Blue Whale is “a mammal.” The Mac version is still a tiny bit flaky, but almost everything else in this list works best inside Emacs. I use Enrico Franconi’s Enhanced Carbon Emacs, which comes with some of the bells and whistles described below. There’s also a version available from Mindlube. Emacs is very powerful, and free. Combining Emacs with some other applications and add-ons allows me to integrate writing and data-analysis effectively.

  • LaTeX. A document processing and typesetting system. Produces beautiful documents from marked-up text files. Very powerful, and free. Available in convenient form for Mac OS X via Gerben Wierda’s i-Installer. If you want to try it out, but don’t want to learn Emacs, download TeXShop and use that as your editor instead.

  • AUCTeX. Enhances Emacs no end for use with LaTeX. Makes it easy to mark-up, process and preview LaTeX documents. AUCTeX is part of Emacs, though not always in its most recent version. If you’re a Mac user, it’s worth getting the most up-to-date version of AUCTeX because you can configure its “LaTeX this” command to produce a PDF file by default.

  • RefTeX. Enhances AUCTeX to help you outline documents more easily, and manage references to Figures, Tables and bibliographic citations in the text. Both AUCTeX and RefTeX could also be under the “Minimize Error” section below, because they automagically ensure that, e.g., your references and bibliography will be complete and consistent.

Analyze Data and Present Results.

  • R. An environment for statistical computing. Very powerful, and free. (Are you detecting a pattern here?) Exceptionally well-supported, continually improving, and with a very active user community, R is a model example of the benefits of Free and Open Source Software. There’s plenty of contributed documentation that’s freely available. It’s got a growing, and very strong, supporting literature, comprising of several introductory texts, companions to existing textbooks, implementing modern statistical methods, regression modeling strategies, and specialized types of models. R integrates very well with LaTeX. I use it with ESS (see next entry) inside Emacs, but it also has a graphical interface on the Mac. R also has very powerful graphics capabilities.

  • ESS. Emacs Speaks Statistics. An Emacs package that allows you to edit R files and run R sessions inside of Emacs. Does syntax highlighting and other things as well, to make your code easier to read. ESS is free software.

Minimize Error.

  • Sweave. A literate programming framework for mixing text and R code in way that allows you to reliably document and reproduce your data analysis within a LaTeX file. In the ordinary way of doing things, you have the code for your data analysis in one file, the output it produces in another, and the text of your paper in a third file.[1] You do the analysis, collect the output and copy the relevant results (often reformatting them) into your paper. Each transition introduces the opportunity for error. It also makes it harder to reproduce your own work later. Almost everyone who has written a paper has been confronted with the problem of reading an old draft containing results that you want to revisit, but can’t quite remember how you produced them. With Sweave, you just have one file. You write the text of your paper (or, more often, your report of a data analysis) as normal, in LaTeX markup. When the time comes to do some data analysis, produce a table or display a figure, you write a block of R code to produce the output you want right into the paper. Then you ‘weave’ the file: R processes it, replaces the code with the output it produces, and spits out a finished LaTeX file that you can then turn into a PDF. An example will make this easier to understand. It’s pretty straightforward in practice. The only downside to the Sweave work model is that when you make changes, you have to reprocess the all of the code to reproduce the final LaTeX file. If your analysis is computationally expensive this can take up time. There are ad hoc ways around this (selectively processing code chunks, for instance) that may eventually appear as features in a new version of Sweave. Sweave comes built-in to R.

  • RCS. A Revision Control System. Allows you to keep a complete record of changes to a file, creating a tree of versions as you make changes. This allows you to revisit earlier versions of papers and data analyses without having to keep directories full of files with names like Paper-1.tex, Paper-2.tex, Paper-3-a-i.tex, and so on. RCS is the oldest of the revision control managers directly supported by Emacs. CVS is a newer version that supports multiple authors, and Subversion is newer again. I haven’t used these: Subversion looks interesting, but integration with Emacs’ version control menu isn’t quite there yet. RCS is free.

  • Unison. I have a laptop and a desktop. I want to keep certain folders in both home directories synchronized. Unison is an efficient command-line synchronization tool that can work locally or use SSH for remote clients. There’s also a GUI version. Unison is free. Many other file synchronization tools are available for Mac OS X, but I haven’t used them.

Pros and Cons

From my point of view, the Workflow applications I use have three main advantages. First, they’re free and open. Second, they deliberately implement “best practices” in their default configurations. Writing documents in LaTeX markup encourages you to produce papers with a clear structure, and the output itself is of very high quality aesthetically. By contrast, there are strong arguments to the effect that, unless you’re very careful, word processors are stupid and inefficient] Similarly, by default R implements modern statistical methods in a high-quality way that discourages you from thinking in terms of canned solutions. It also produces figures that accord with accepted standards of efficient and effective information design. (There’s no chartjunk.) And third, the applications are well-integrated. Everything works inside Emacs, and all of them talk to or can take advantage of the others. R can output LaTeX tables, for instance, even if you don’t use Sweave.

At the same time, I certainly didn’t start out using all of them all at once. Some have fairly steep learning curves. There are a number of possible routes in to the applications. You could try LaTeX first, using any editor. (A number of good ones are available for Mac OS and Windows.) Or you could try Emacs and LaTeX together first. You could begin using R and its GUI. Sweave can be left till last, though I’ve found it increasingly useful since I’ve started using it, and wish that all of my old data directories were documented in this format.

A disadvantage of the particular applications I use is that I’m in a minority with respect to other people in my field. Most people use Microsoft Word to write papers, and if you’re collaborating with people (people you can’t boss around, I mean) this can be an issue. Similarly, journals and presses in my field generally don’t accept material marked up in LaTeX. Converting files to Word can be a pain (the easiest way is to do it by converting your LaTeX file to HTML first) but I’ve found the day-to-day benefits outweigh the network externalities. Your mileage, as they say, may vary.

A Broader Perspective

It would be nice if all you needed to do your work was a bunch of well-written and very useful applications. But of course its a bit more complicated than that. In order to get to the point where you can write a paper, you need to be organized enough to have collected some data, read the right literature and, most importantly, be asking an interesting question. No amount of software is going to solve those problems for you. Believe me, I speak from experience. The besetting vice of an interest in productivity-enhancing applications is the temptation to waste a tremendous amount of time installing productivity-enhancing applications. The work-related material on my computer tends to be a lot better organized than my approach to generating new ideas and managing the projects that come out of them—and of course those are what matter in the end. The process of idea generation and project management can be run efficiently, too, but I’m not sure I’m the person to be telling people how to do it.

Notes

fn1. Actually, in the worst but quite common case, you use a menu-driven statistics package and do not record what you do, so all you have from the data analysis is the output.

fn2. I think that the increase in online writing and publishing has made Word Processors look even worse than they used to. If you want to produce text that can be easily presented as a standards-compliant Web page or a nicely-formatted PDF file, then it’s much easier to use a text editor and a “rendering pipeline” that supports a markup system like Textile or Markdown. But that’s a rant for another day.