Posts in “Data”
I’m teaching a short graduate seminar on Data Visualization with R this semester. Following Matt Salganik, I wanted students to be able to submit homework or other assignments as R Markdown files, but to have a way to make sure their R code passed some basic stylistic checks provided by lintr before they submitted it to me. Students write .Rnw files containing discussion or notes interspersed with chunks of R code.
Another week, another mass shooting in the United States. I’ve linked before to my posts America is a Violent Country, and Assault Deaths Within the United States. I thought I would update the figures with the latest data from the OECD. The method and scope are the same as before. Here is the main figure, showing assault death rates for the US and 23 other OECD countries. There’s a PDF available.
The other day, Jonathan Marshall posted a nice graphic showing population age profiles of electoral constituencies in New Zealand, ordered by their tendency to vote left or right. He put the data on github, and on a long transatlantic flight yesterday I ended up messing around with it a bit. Almost the only bit of Demography I know is the old saw that women get sicker but men die quicker. So I thought I’d take a look at differences in the sex composition of age cohorts by constituency.
In an effort to not lose all of my lucrative Consulting Thinkfluanalyst income to the snowman, I redrew my LOESS and LTS decompositions of Apple’s quarterly sales data by product. They now extend to Q2 2015. First, here’s a plot of the trends showing the individual sales figures with a LOESS smoother fitted to them. Figure 1. Quarterly sales data for Apple Macs, iPhones, and iPads. Here’s the Mac by itself, which continues to grow healthily (unlike the rest of the PC industry), just on a smaller scale than other Apple products.
Choropleth maps of the United States are everywhere these days, showing various distributions geographically. They’re visually appealing and can be very effective, but then again not always. They’re vulnerable to a few problems. In the U.S. case, the fact that states and counties vary widely in size and population means that they can be a bit misleading. And they make it easy to present a geographical distribution to insinuate an explanation.
This morning, Social Science Twitter is consumed by the discovery of fraud in a very widely-circulated political science paper published last year in Science magazine. “When contact changes minds: An experiment on transmission of support for gay equality”, by Michael LaCour and Donald Green, reported very strong and persistent changes in people’s opinion about same-sex marriage when voters were canvassed by a gay person. The paper appeared to have a strong experimental design and, importantly, really good follow-up data.
The hosts at Accidental Tech Podcast have been thinking about how to broaden their base of listeners to include more women. Good for them. They’re getting plenty of advice (and a certain amount of flak), which I won’t add to. But in general when doing this kind of thing it can be helpful to look back on what your past practice has been. For example, it can be useful to audit one’s own habits of linking and engagement.
This UK Election data is really too much fun to play around with. Here’s a (probably final) collection of pictures. First, a map of the turnout (that is, the percentage of the electorate who actually voted) by constituency, with London highlighted for a bit more detail. Constituencies by Turnout. There’s a strong suggestion here that Labour areas have lower turnout. Here’s a scatterplot of all seats showing the winning candidate’s share of the electorate plotted against turnout.
I’m still playing around with the UK Election data I mapped yesterday, which ended up at the Monkey Cage blog over at the Washington Post. On Twitter, Vaughn Roderick posted a nice comparison showing the proximity of many Labour seats to coalfields. That got me thinking about how much the landscape of England is embedded in its political life. In particular, what do the names of places tell you about their political leanings?
The United Kingdom’s election results are being digested by the chattering classes. So, yesterday afternoon I thought I’d see if I could grab the election data to make some pictures. Because the ever-civilized BBC has election web pages with a sane HTML structure, this proved a lot more straightforward than I feared. (Thanks also in no small part to statistician Hadley Wickham’s rvest scraping library, alongside many other tools he has contributed to the community of social scientists who use R to do data analysis.) Here are two maps.
A side-note to the enjoyable exchange with Dr Drang about sales trends in Apple products, which was picked up by John Gruber. The LOESS decompositions I posted looked like this: Quarterly sales decomposition for iPhones. One or two people remarked that these figures were shorter and wider than they were used to seeing. I did this on purpose—following the approach taken by William Cleveland and others, the charts are banked, meaning the aspect ratio is set to make it easier to pick out trends.
Update (April 30th): I redrew the decomposition plots this morning, and added a couple more. Another Twitter conversation, this time in the evening. Dr Drang put up a characteristically sharp post looking at sales trends in Apple Macs, iPhones, and iPads. He used moving averages to show long-term sales trends effectively, and he made a convincing argument that iPad sales are in decline. I ended up grabbing the sales data myself from barefigur.es and more or less copying him.
Update: Updated to identify Catholic schools. (And again later, with more Catholic schools ID’d.) I took another look at the vaccination exemption data I discussed the other day. This time I was interested in getting a closer look at the range of variation between different sorts of schools. My goal was to extract a bit more information about the different sorts of elementary schools in the state, just using the data from the Health Department spreadsheet.
California Kindergarten PBE Rates by Type of School, 2014-15. (PDF available.) I came across a report this afternoon, via Eric Rauchway, about high rates of vaccination exemption in Sacramento schools. As you are surely aware, this is a serious political and public health problem at the moment. Like Eric, I was struck by just how high some of the rates were. So I went and got the data from the California Department of Public Health, just wanting to take a quick look at it.
The new Philosophical Gourmet Report Rankings are out today. The report ranks a selection of Ph.D programs in English-speaking Philosophy departments, both overall and for various subfields, on the basis of the judgments of professional philosophers. The report (and its editor) has been controversial in the past, and of course many people dislike the idea of rankings altogether. But as these things go the PGR is pretty good. It’s a straightforward reputational assessment made by a panel of experts from within the field.
Nate Silver’s relaunched FiveThirtyEight has been getting some flak from critics—including many former fans—for failing to live up to expectations. Specifically, critics have argued that instead of foxily modeling data and working the numbers, Silver and his co-contributors are looking more like regular old opinion columnists with rather better chart software. Paul Krugman has been a prominent critic, arguing that “For all the big talk about data-driven analysis, what [the site] actually delivers is sloppy and casual opining with a bit of data used, as the old saying goes, the way a drunkard uses a lamppost — for support, not illumination.” Silver has put his tongue at least part way into his cheek and pushed back a little with an article titled, in true Times fashion, “For Columnist, a Change of Tone”.
Jim Moody and I are writing an article on data visualization in Sociology. Here’s a picture that won’t be in the final version, but I like it all the same.
Corrections and Changes as of June 26th, 2013: See the end of the post for details on some corrections and changes to the analysis. In the previous post I promised I would say something about the influence of David Lewis, and also something about citation frequency by gender. Some caveats at the outset. First, as I said before, this is exploratory work. I’m still in the process of cleaning the data and correcting mistakes, so things may change (although hopefully just around the margins).
Corrections and Changes as of June 26th, 2013: See the end of the post for details on some changes and fixes to errors in the data. What have English-speaking philosophers been talking about for the last two decades? I’m asking—and presenting an answer to—this question partly out of an ongoing research interest in philosophy, partly out of some recent “Does anyone know …?” questions I’ve been asked, and partly to play with some new text-processing and visualization methods.
Yesterday’s post on Using Metadata to Find Paul Revere really caught fire. It’s still going, in fact, and it will probably break a hundred thousand unique pageviews some time this afternoon. It’s always exciting and a little anxiety-making when something like that happens. Overall, I’m delighted that the response has been so positive. By way of follow-up, I’d just say that it’s a single post that was meant to make a point in an accessible and hopefully entertaining way.
London, 1772. I have been asked by my superiors to give a brief demonstration of the surprising effectiveness of even the simplest techniques of the new-fangled Social Networke Analysis in the pursuit of those who would seek to undermine the liberty enjoyed by His Majesty’s subjects. This is in connection with the discussion of the role of “metadata” in certain recent events and the assurances of various respectable parties that the government was merely “sifting through this so-called metadata” and that the “information acquired does not include the content of any communications”.
When I posted the Sociology Department Rankings for 2013 I joked that Indiana made it to the Top 10 “due solely to Fabio mobilizing a team of role-playing enthusiasts to relentlessly vote in the survey. (This is speculation on my part.)” Well, some further work with the dataset on the bus this morning suggests that the Fabio Effect is something to be reckoned with after all. The dataset we collected has—as best we can tell—635 respondents.
Last week we launched the OrgTheory/AAI 2013 Sociology Department Ranking Survey, taking advantage of Matt Salganik’s excellent All Our Ideas service to generate sociology rankings based on respondents making multiple pairwise comparisons between department. That is, questions of the form “In your judgment, which of the following is the better Sociology department?” followed by a choice between two departments. Amongst other advantages, this method tends to get you a lot of data quickly.
Update: See the addendum at the end of this post for the response I got from the Times. Yesterday I got an email from an editorial assistant at the Times: Hi Professor Healy, We are publishing a column today that may reference the data you use here in your post here: http://www.kieranhealy.org/blog/archives/2012/12/18/assault-death-rates-in-america-some-follow-up/ You mention that the OECD stats are gated — any chance you could share them with us, for fact checking purposes?
The Newtown elementary school shooting led people to link to and share my graphs of OECD and CDC data on assault deaths in the United States. I made them last July, in the wake of the Aurora movie theater shooting. What a depressing reason to be in the newspapers. Here are the original posts: America is a Violent Country, and Assault Deaths Within the United States. The original posts clearly explain what the data show and what the sources are.
Trends in the Death Rate from Assault, 1999–2009, by Region. Click for a larger PNG or PDF. Update: You can click here for some further followup to this post, answering some common questions. The chart in “America is a Violent Country” has been getting a lot of circulation. Time to follow up with some more data. As several commentators at CT noted, the death rate from assault in the U.S. is not uniform within the country.
Update (October 2015). For an update including more recent data, see this post: Assault Death Rates, 1960-2013 Update (December 2012). For answers to some frequently-asked questions about this post, see this follow-up discussion. You can also read more about patterns of assault deaths within the United States. The terrible events in Colorado this morning prompted me to update an old post about comparative death rates from assault across different societies. The following figures are from the OECD for deaths due to assault per 100,000 population from 1960 to the present.
Thanks to a link from Marco Arment, the Talk Radio post got a lot of traffic. Following up on some twitter and email requests, here is some additional stuff. First, the data again, but with The Incomparable added: Click for a larger version. Second, a different look at variation by show: Click for a larger version. And finally, for those interested, there’s a github repo with the data and R code.
After our analysis of the Hypercritical data it only seemed fair to check whether other 5by5 hosts were prone to talk longer the longer their show has been on the air. As it happens, the spreadsheet-like layout of iTunes makes it easy to copy and paste episode data into a usable format. (Although, inevitably, some cleaning is required—a pox on you, inconsistent time formats.) I used the episode-length data for the 5by5 shows I subscribe to that also had a large-enough number of episodes to look at.
Via John Siracusa, a really nice exercise in crowdsourcing and data visualization on Bostonography. we’re running an ongoing project soliciting opinions on Boston’s neighborhood boundaries via an interactive map. We want to keep collecting data, but we’ve already received excellent responses that we’re itching to start mapping, and when we hit 300 submissions recently it seemed like a good enough milestone to take a crack at it. (That’s actually 300 minus some junk data.
Following original work by Nic at 2000 Nickels (a fellow Octopress user, I notice), here’s another effort to answer the vital question of the moment about Hypercritical, namely whether John Siracusa’s effort to control his logorrhea has met with any success. Click for a larger version. Click for a larger version. The lines (they are loess lines) show the trend in the length of Hypercritical shows. The upper panel shows the overall trend.
I’ve made some updates to the Emacs Starter Kit for the Social Sciences. The kit builds on Phil Hagelberg’s original and Eric Schulte’s org-mode version, and incorporates some packages and settings that are particularly useful for the social sciences. See the Starter Kit’s Homepage for more details. The new version requires Emacs 24, which is not quite officially released but is in very good shape. See the project page for more information about what’s included in the starter kit and how to install it.
The other day Brett Terpstra posted a gigantic and quite beautifully-executed feature comparison of all of the text editors available for iOS devices. The table is really terrific and also a bit overwhelming, as there’s so much data. On the bus home yesterday, it struck me that it might make for a nice data visualization exercise. There are all kinds of ways one might choose to represent the information, of course—how you visualize data depends on what you want to do with it.
The philosopher Ruth Marcus died two weeks ago, but—as Brian Leiter noted—no obituary for her has appeared in a major newspaper. Michael Della Rocca and some colleagues have circulated a letter calling on the New York Times to rectify this, which I agree they should. In the comments over at Feminist Philosophers, Catarina asks how many of the philosophers who did get an obituary in the NYT were women. In partial answer, I looked at the number of obituaries that have appeared in the Times since 2000 of people who were described primarily as philosophers.
From ITO comes this very nice—and very sobering—map of road accident fatalities in the United States between 2001 and 2009. As someone who wrote a book about blood and organ donation in Europe and the United States, I’ve spent time analyzing NHTSA data on traffic accidents. I remember that, during Q&As at talks, people were often surprised to learn just how many road deaths there are in the U.S.: about forty thousand per annum (though 2009 saw a very sharp drop, interestingly).
Oh look, some evidence that inflammatory claims in something written by Satoshi Kanazawa may not rest on the deep structure of reality or spring from his special ability to speak uncomfortable truths, but may instead arise from an inability to analyze AddHealth data properly. I for one am stunned.
The current issue of New Left Review has an article by Franco Moretti applying a bit of network analysis to the interactions within some pieces of literature. Here is the interaction network in Hamlet, with a tie being defined by whether the characters speak to one another. (Notice that this means that, e.g., Rosencrantz and Guildenstern do not have a tie, even though they’re in the same scenes.) And here is Hamlet without Hamlet: I think we can safely say that he is a key figure in the network.
Here is a very old joke. A soldier is captured during a long-running war and thrown into the most stereotypical prison cell imaginable. Inside the cell is another solider. He has an enormous, disgusting-smelling beard and has clearly been there a long time. The young soldier immediately sets about trying to escape. He is resourceful and possessed of great willpower. He bribes a guard with his emergency supply of cash. The guard gets him into a supply truck and he makes it to the prison garage, but is found during a routine vehicle search while exiting the compound.
I saw this report go by on the Twitter saying that, in the wake of the latest budget deal, the Census Bureau is planning on eliminating the Statistical Abstract of the United States, pretty much the single most useful informational document the Government produces. The report says, When readying the FY2011 budget, the Census Bureau tapped teams to do thorough, systematic program reviews looking for efficiencies and cost savings. Priorities for programs were set according to mission criticality, and some cuts were made to the economic statistics program.
This morning I listened to an interesting interview on one of Dan Benjamin’s shows. He was talking to Erin Kissane about her new book, The Elements of Content Strategy. Say you are using a website to communicate something to someone, or enable communication between a group of people, or both. The something you are conveying or facilitating is your content. According to Kissane, the job of a “content strategist” is to figure out how best to make sure that content is assembled, presented, and maintained in a way that’s appropriate to its audience.
A useful bit of interactive data visualization for Emmanuel Saez’s time-series on historical trends in income growth and distribution in the United States. As you can see, between 1970 and 2008 people in the bottom 90 percent of the income distribution typically chose not to partake of annual increases in total income, presumably because of a tendency to prefer and thus self-select into lower-paying jobs, or possibly because of an innate dislike for the more complex mathematics (surrounding tax calculations, car payments, and budgeting generally) that is associated with earning more money.
Following up on a conversation with a friend in Philosophy, I took a quick look at the Survey of Earned Doctorates to see the breakdown by gender for Ph.Ds awarded in the United States in 2009. Some nice pictures: Percent female by Division (with Philosophy picked out); Percent female for selected disciplines; and a giant percent female for (almost) all disciplines, with Philosophy picked out for emphasis. The links go to PDFs.
Note (September 2013): Recent changes to Org-Mode since version 8 mean that the instructions here are no longer valid. My Emacs Starter Kit for the Social Sciences contains a more up-to-date export setup consistent with Org-Mode 8 and higher. The reason the instructions below were complicated was partly because of difficulties exporting with XeLaTeX but partly because I wanted—for perhaps irrational reasons—to preserve the ability to have different export pipelines for XeLaTeX and pdfLaTeX.
Because the next official release of Emacs will finally have a built-in package management system, I’ve been able to update the Emacs Starter Kit for the Social Sciences to make it easier to set up. AucTeX is now installed directly as a package, and so is ESS. While the AucTeX package is official, I host the ESS package myself. I haven’t made any changes to ESS, just added a short .el file that the package manager needs.
So, where should you go to find sociology rankings? The NRC? No. U.S. News and World Report? Perhaps, if you also need to catch up on AARP-related news and events. Google, however, shows a new player in the game: So there you have it. If you search for sociology department rankings you’ll find OrgTheory-related material is the first and third hit, with US News managing 2nd (for now) and NRC back somewhere in the dust.
More starter kit stuff. Up till now, the Emacs Starter Kit for the Social Sciences included ESS, but bundled it with the git repo. A better option would be to have it installed via the package mechanism, like AucTeX is now, but it’s not included. The ELPA system is allows you to specify repositories besides the official ones, so I’ve created a repository on my own site containing just ESS. I’ve updated the starter kit to include a pointer to it, so now on first install the kit will pull in ESS from there, and compile it for you.
New in nerdery this week, it’s now a bit easier to install the Emacs Starter Kit for the Social Sciences that I put together (based on lots of great work by Phil Hagelberg and, more recently, Eric Schulte). In the past, the fact that AucTeX was both necessary and had to be compiled locally made for some awkward steps in the installation. But AucTeX is now part of the new Emacs Package Manager, so it’s possible to install it automatically.
I think we can all agree that the NRC rankings were a disaster. So, Steve Vaisey and I think that we can generate a new list from scratch. Using Matt Salganik’s excellent “All Our Ideas” site, we’ve set up a tool for pairwise comparison of Sociology Departments. The goal is to get as many head-to-head snap judgments as possible. You can vote as many times as you like — in fact, it’s encouraged.
Following up on some work Gabriel has been doing, here’s a way to accomplish the same sort of thing, with less reliance on loops and more on functions that work on lists. Also, a way to manage the conversion of the .png files to an animated .gif without having to manually rename files. As I say in the comments over at Code and Culture, if the code works as a loop there’s not necessarily a strong reason to vectorize it, but I’d be interested to see whether this approach was at all faster.
I found this post that provides a nice function for conveniently showing some information about R objects in ESS mode. ESS already shows some information about functions as you type them (in the status bar) but this has wider scope. Move the point over an R object (a function, a data frame, etc), hit C-c C-g and a tooltip pops up showing some relevant information about the object, such as the arguments a function takes or a basic summary for a vector and so on.
Via Cosma Shalizi, reports of a very interesting piece of work: Prejudice and truth about the effect of testosterone on human bargaining behaviour, C. Eisenegger, M. Naef, R. Snozzi, M. Heinrichs & E. Fehr, Nature 463, 356-359 (21 January 2010). The abstract: Both biosociological and psychological models, as well as animal research, suggest that testosterone has a key role in social interactions0^0^0^0^0^0^0^. Evidence from animal studies in rodents shows that testosterone causes aggressive behaviour towards conspecifics0^.
Jeremy Freese is doing some analysis: So, the General Social Survey reinterviewed a large subset of 2006 respondents in 2008. They have released the data that combines into one file the respondents interviewed for the first time in 2008 and the 2008 reinterviews of the respondents originally interviewed in 2006. In a separate file, of course, you can get the original 2006 interviews for the latter people. What has not yet been released, however, is the variable that would identify what row in the first file corresponds to what row in the second file.
If you use Emacs and ESS to run R, then here’s a nice tweak I found on the Emacs Wiki. The following bit of elisp goes in your .emacs file (or equivalent). Starting with an R file in the buffer, hitting shift-enter vertically splits the window and starts R in the right-side buffer. If R is running and a region is highlighted, shift-enter sends the region over to R to be evaluated.
Via a FB friend: As of April 1, 2006, out of a 2004 Census estimated population of 18 in Teterboro, there were 39 registered voters (216.7% of the population, vs. 55.4% in all of Bergen County). Sadly, the answer may be prosaic. From earlier in the same Wikipedia entry: The 2000 census failed to count any of the residents of the Vincent Place housing units who had moved into the newly built homes in 1999.
Over on the Edge of the American West, Eric has been working up some graphs on what the WPA spent its money on. Eric’s own presentation of the data showed clearly that the WPA spent much more of money on highways roads and streets than anything else—so much so, in fact, that graphing it directly obscured some of the variation across the other categories. There was some discussion. My own view in that conversation was (a) that dotplots were preferable to barcharts, and, more substantively, (b) that if you wanted to hone in on the smaller categories, it would be easiest to graph them separately, having established “Highways, roads and streets” as the predominant focus of spending.
Studies of network contagion in health outcomes and behaviors (such as obesity and smoking) are all the rage these days. So it is interesting to read this paper in the current BMJ by Cohen-Cole and Fletcher that uses Add-Health data to establish some statistically significant but substantively rather implausible effects of just this sort: Objective To investigate whether “network effects” can be detected for health outcomes that are unlikely to be subject to network phenomena.
There’s an article on the Netflix Prize in the Times today. You know, where Netflix made half of its ratings data available to people and offered a million bucks to anyone who could write a recommendation algorithm that would do some specified percent better than Netflix’s own. What tripped me up was this sentence about one of the more successful teams: The first major breakthrough came less than a month into the competition.
I’m so far behind on this one. Here’s a figure based on a table Eric sent me. There is a PDF version. There is also a 4-category version (with a PDF too), that breaks out farm workers from the main category.
My periodically-updated guide to choosing your workflow applications has received one of its periodic updates. It has grown an abstract and more up-to-date stuff on backups and versioning. Plus extra jokes. Click through to read—you may have to reload it if you have an old version lurking in your browser cache. This release is officially named “Dime On You Crazy Shriner” in honor of orgtheory’s favorite pseudonymously-produced, bot-driven, stylistically-elliptical demisemisocblog.
Over the past few months, I’ve been messing around with Git and Mercurial, two modern, distributed version control systems (DVCSs). While designed by software engineers, these systems are very useful to people who, like me, write papers and do data analysis in some plain-text file format or other, who very often revise those files, sometimes splitting them off into different branches as projects develop, and who do this work on more than one computer.
A discussion about Mac applications at Scatterplot (which is threatening to spill over into a Windows vs OS X war) reminded me of something. Although not by any means a quant jock, a good deal of my work involves analyzing quantitative data. Almost since I learned how to do that kind of thing at all, I have used software tools designed to make the process easier and less error-prone. The most basic of these is a proper programmer’s text editor with support for whatever statistical software I’m using.
By now you’ve probably all seen this ridiculous graphic from todays’ WSJ, which purports to show that the Laffer curve is somehow related to the data points on the figure. Brad DeLong, Kevin Drum, Matt Yglesias, Mark Thoma and Max Sawicky have all rightly had a good old laugh at it, because it’s spectacularly dishonest and stupid. I just want to make a point about so-called outlying cases, like Norway. In discussion threads about this kind of thing, you’ll find people saying stuff like, “I want to see a line showing x z or z”, or “I want to know what happens when you …”, and very often they’ll add “excluding outliers like Norway from the analysis.” Now, it’s true that in this plot Norway is very unlike the other countries.
Jeff Han works on multi-touch interfaces: touch screens that can recognize more than one point of input, and thus combinations of gestures and so on. Here’s a cool video showing some of the interface methods his company is developing. (Warning: cheesy music.) You can see some cool possibilities for educational bells and whistles, such as the taxonomic tree one of the operators is seen navigating. The possibilities for high-dimensional dynamic data visualization are also obvious.