Looking at Data

Jeremy Freese is doing some analysis:

So, the General Social Survey reinterviewed a large subset of 2006 respondents in 2008. They have released the data that combines into one file the respondents interviewed for the first time in 2008 and the 2008 reinterviews of the respondents originally interviewed in 2006. In a separate file, of course, you can get the original 2006 interviews for the latter people.

What has not yet been released, however, is the variable that would identify what row in the first file corresponds to what row in the second file. In other words, you know that person #438 in the reinterview data is somebody originally interviewed in 2006, but you don’t know what person in the 2006 data there are.

Well, especially because the last thing I need to be doing right now is procrastinating, that sounded like a challenge. Just as I have learned that just because there are no microwave instructions for a frozen dinner doesn’t mean you can’t microwave it, just because there isn’t a merge variable doesn’t mean you can’t merge the data. At least if no secure data agreement is involved.

All I have to say is: holy crap. You’d think knowing somebody’s sex, survey ballot (which was kept the same both times), zodiac sign, year of birth, self-identified race, region where they lived where they were 16, whether they lived with their parents when they were 16, whether they lived in the same place they did growing up, who they said they voted for in 2004, their marital status, their education, what they say they did for a living, how many years their mother went to school, inter alia, would allow you to pretty easily pinpoint who is who. I am here to tell you this is not the case.

I was able to devise some convoluted scheme and check how well it was doing thanks to a pretty big clue that I’ll refrain from posting, but even then there ended up being 50 cases that out of 1500 that I wasn’t sure who they were. In general the experience affirmed a fundamental suspicion I’ve had about analyzing survey data: the data seem so much less real once you ask the same person the same question twice.

The real distinction between qualitative and quantitative is not widely appreciated. People think it has something to do with counting versus not counting, but this is a mistake. If the interpretive work necessary to make sense of things is immediately obvious to everyone, it’s qualitative data. If the interpretative work you need to do is immediately obvious only to experts, it’s quantitative data.