Sun, Nov 10, 2019

Cleaning the Table

While I'm talking about getting data into R this weekend, here's another quick example that came up in class this week. The mortality data in the previous example were nice and clean coming in the door. That's usually not the case. Data can be and usually is messy in all kinds of ways. One of the most common, particularly in the case of summary tables obtained from some source or other, is that the values aren't directly usable.

Sat, Nov 9, 2019

Reading in Data

Here's a common situation: you have a folder full of similarly-formatted CSV or otherwise structured text files that you want to get into R quickly and easily. Reading data into R is one of those tasks that can be a real source of frustration for beginners, so I like collecting real-life examples of the many ways it's become much easier. This week in class I was working with country-level historical mortality rate estimates.

Mon, Oct 28, 2019

Dogs of New York

The other week I took a few publicly-available datasets that I use for teaching data visualization and bundled them up into an R package called nycdogs. The package has datasets on various aspects of dog ownership in New York City, and amongst other things you can draw maps with it at the zip code level. The package homepage has installation instructions and an example. Using this data, I made a poster called Dogs of New York.

Sun, Oct 27, 2019

Reconstructing Images Using PCA

A decade or more ago I read a nice worked example from the political scientist Simon Jackman demonstrating how to do Principal Components Analysis. PCA is one of the basic techniques for reducing data with multiple dimensions to some much smaller subset that nevertheless represents or condenses the information we have in a useful way. In a PCA approach, we transform the data in order to find the “best” set of underlying components.

Mon, Oct 21, 2019

Widening Multiple Columns Redux

Last year I wrote about the slightly tedious business of spreading (or widening) multiple value columns in Tidyverse-flavored R. Recent updates to the tidyr package, particularly the introduction of the pivot_wider() and pivot_longer() functions, have made this rather more straightforward to do than before. Here I recapitulate the earlier example with the new tools. The motivating case is something that happens all the time when working with social science data. We'll load the tidyverse, and then quickly make up some sample data to work with.

Tue, Oct 15, 2019

Parsing Sda Pages

SDA is a suite of software developed at Berkeley for the web-based analysis of survey data. The Berkeley SDA archive ( lets you run various kinds of analyses on a number of public datasets, such as the General Social Survey. It also provides consistently-formatted HTML versions of the codebooks for the surveys it hosts. This is very convenient! For the gssr package, I wanted to include material from the codebooks as tibbles or data frames that would be accessible inside an R session.

Thu, Oct 10, 2019

Back in the GSSR

The General Social Survey, or GSS, is one of the cornerstones of American social science and one of the most-analyzed datasets in Sociology. It is routinely used in research, in teaching, and as a reference point in discussions about changes in American society since the early 1970s. It is also a model of open, public data. The National Opinion Research Center already provides many excellent tools for working with the data, and has long made it freely available to researchers.

Mon, Aug 26, 2019

This Is Just the Verse

They pluck your plums, your mum and dad They eat them for their supper, too They gobble all the fruit you had And leave some bullshit note for you But they were robbed blind in their day Of damsons, prunes, and blackthorn sloes Their breakfast treats all poached away Thefts justified with old-style prose “Forgive us” both your parents moan “They were delicious, sweet, and cold” They wonder why I never phone

Sat, Aug 3, 2019

Rituals of Childhood

Back in April, in Ireland, my nephew Luke made his first communion alongside his school classmates. I did much the same thing myself in much the same place about forty years ago. My brother tells me that the preparation nowadays is a little more humane than the version we enjoyed. But there is as much anticipation beforehand, and no less excitement on the day. Luke’s little suit lacked the stylish navy-blue velvet panels mine sported in 1980, but in essence the event was the same in its purpose, its form, and in most of its details.

Sun, Jun 23, 2019

Earned Doctorates

PhDs awarded in selected disciplines, 2006-2016. Thierry Rossier asked me for the code to produce plots like the one above. The data come from the Survey of Earned Doctorates, a very useful resource for tracking trends in PhDs awarded in the United States. The plot is made with geom_line() and geom_label_repel(). The trick, if it can be dignified with that term, is to use geom_label_repel() on a subset of the data that contains the last year of observations only.

Sociology and other distractions, since 2002. View an index of posts by category. R-related posts also appear on R-Bloggers.


I am Professor of Sociology at Duke University. I’m also affiliated with the Kenan Institute for Ethics. Read a brief overview of my work or my Curriculum Vitae.



To be notified of updates, you can subscribe to the  RSS feed for the site.