Sometimes data arrives as a series of individual files each of which is organized in the same way—which is to say, each of which has the same variables, features, or columns. Imagine a series of tables reporting mandated information about every school in the state, or a hundred spreadsheets each with information about a different country, or thirty seven CSVs each with the same columns of information about representatives in each U.S. Congressional Session since 1945. Seeing as these data were not provided to us as a single big table we generally want to make it into one. If the files are CSVs on our local computer, R has some nice functions that allow us to iterate over a vector of filenames and produce a tidy table of data.
Case 1: Lots of local CSV files
For example, our Congressional project might have a
data folder with a subfolder called
congress. We can get a listing of the CSV files inside it like this:
We can feed that vector to
read_csv() and it will be quite happy. It reads each file and binds them all together by row, in a big stack. (Again, remember we know ex ante that the files all have the same column structure. We’ll get errors or warnings if this isn’t true.)
You can see how
path is created as an id column, to help us keep track of which file each row of data came from. We create the
congress column after we read in the data by extracting the congressional session from the filename with a regular expression. But you can see how nice it is to have this facility to read data in like this.
Case 2: Lots of local Excel files
What if the data is in a file format whose read-in function doesn’t know this trick about accepting a vector of file paths? In that case we can do what
read_csv() is doing behind the scenes and map the vector of file names to the read-in function, and explicitly bind the results together. The default
map() function binds by making a list of whatever you did. But there are type-specific mappers, too. We are dealing with data frames—with regular tables of data—so we have a function,
map_dfr(), that takes input and binds its output by row into a data frame (or dies trying). So for example, if our Congressional data were all Excel files in
.xlsx format, and they all had the same structure with a header row of column names, we could write this:
And we would get (almost) the same result. Here we start with the vector of filenames and pass it down a pipe to
map_dfr(), which goes ahead and maps, or applies, the
read_xlsx() function to each element of the filenames vector—i.e. to each file. The
.x there is a placeholder or pronoun that means ‘whatever one we’re working on right now’. Mapping functions like this is just a kind of iteration where you don’t have to explicitly write a loop. This makes it easier to cleanly compose sequences or chains of functions without having to explicitly create a bunch of placeholder objects or declare counters and so on.
Case 3: Lots of remote CSV or Excel files in a bare directory
Now, what if the files we want are stored remotely on a server? These days there’s often an API for such things. But quite often, even now, you may find yourself dealing (as I did yesterday) with a bare directory of files that looks like this:
Now, if these were provided as CSVs our task would be a little easier because in addition to being able to deal with a vector of filenames at once,
read_csv(), and indeed all the read-in functions in
readr in general, will happily read URLs as well as local file paths. However, the
read_xlsx() function in the
readxl package can’t do this yet. It only wants file paths. A second issue is that the Excel files themselves are not entirely tidy. At the top they look like this:
Those first two rows are a mild violation of one of the rules of thumb for entering data in spreadsheets, helpfully outlined by Karl Broman and Kara Woo. The first row is metadata; the second is a more verbose description of the standard lifetable headers in the third row. Except for Age, which is not labeled in the third row. That big box labeled “Age (years)” is actually an super-sized second row. That means the first element of row three, our actual column headers, is blank! This is annoying. Fortunately these are easily dealt with, as we can just tell our read function to skip those two lines. There’s also a
Source row at the bottom (not shown here) that we’ll have to strip out.
But the first order of business is getting a vector of the actual file URLs to download. You could just copy and paste the listing, like an animal, but we are not going to do that. Instead, we’ll take advantage of the old-school empty-directory listing to get the file names. We’ll do this using R’s implementation of
We open a connection to the remote folder and use
dirlistonly flags to restrict what we get back. Then we read the lines recieved from the FTP server into the
files object. This gives us the bare file names of everything in this remote folder.
The spreadsheets are named according to a scheme with a two-letter state abbreviation followed by a number. The number signifies the kind of life-table it is. The files ending in
1 have the life-table for the population as a whole, which is what we are interested in.
If we wanted to do things at a slightly higher level of abstraction we could use
rvest to get the filenames, extract all the link elements, and get the text from inside of them. The
rvest package handles the business of opening and closing web connections for us. It also provides handy functions to get and extract the text of pages based on there position in a CSS selection hierarchy (which isn’t relevant here) or based on particular HTML elements. First we get the page, then we grab all the link elements in it, and then convert their content to a character vector:
In this case we’d need to clean up the resulting vector (to remove navigation links to the parent directory and so on) but would end up in the same place.
Now that we have a vector of the file names (but just the file names at this point) we can do a bit of prep:
You can see that the vector elements (the actual URLs) also have a name or label (the state abbreviation). This will make it easier to create a
state id column, because
map_dfr() will use the label as its index counter.
The last step is to get
read_xlsx() to get all the remote files, which it does not have the capacity to do directly. It won’t even accept a single URL, it only wants file paths. So we will have to write a one-off function that gets the file and puts it in a temporary location that
read_xlsx() can see.
The first line inside the function uses
GET the file, and immediately save it locally using
write_disk(), taking care to specify that the temporary file we save should have an
.xlsx extension. (Otherwise
read_xlsx() will complain.) The second line actually reads in the file that’s been downloaded. We take the opportunity to suppress chatter about the name repair that has to happen on that blank first column header in the third row, rename that location
age, and strip the trailing line about the source of the data that I mentioned above.
This function reads one given URL. Now we just need to map a vector of URLs to it and bind the results by row:
As I said, a nice thing is that
map_dfr() will use the name attribute of
fnames to create its id column, which we can therefore name
state, because the name of each URL element is the abbrevation for the state it is providing data about.
And we’re done:
A few thousand rows of data programmatically extracted from fifty spreadsheets, now ready for a bit more cleaning and any amount of confusion about what it is that life-tables actually show.