Data Visualization: A Practical Introduction will begin shipping next week. I've written an R package that contains datasets, functions, and a course packet to go along with the book. The socviz package contains about twenty five datasets and a number of utility and convenience functions. The datasets range in size from things with just a few rows (used for purely illustrative purproses) to datasets with over 120,000 observations, for practicing with and exploring.
Yesterday, Vox ran a story about changes in food consumption patterns in the United States over the past few decades. It featured this graph:
Vox Time Series
When I saw it, one of those little bells went off in my head:
As a rule, when you see a sharp change in a long-running time-series, you should always check to see if some aspect of the data-generating process changed—such as the measurement device or the criteria for inclusion in the dataset—before coming up with any substantive stories about what happened and why.
Every couple of years—usually after one of the inevitable mass shootings—I find myself updating this graph. The originals were done in 2012. You can read America is a Violent Country, and Assault Deaths Within the United States to see those. This morning I pulled the latest figures from the OECD Health Status database. The method and scope are the same as before. Here is the main figure, showing assault death rates for the US and 23 other OECD countries.
Data Visualization for Social Science: A Practical Introduction with R and ggplot2
I'm writing a book on data visualization, provisionally titled Data Visualization for Social Science: A practical introduction with R and ggplot2. As part of that process, largely because I've benefited so much myself from the availability of open and widely shared tools for software development, I'm making the draft version of the book available as its own website.
I saw this pie chart via Beth Popp Berman on Twitter yesterday:
Pie charts of student debts by percent of all borrowers and percent of all debt.
As you probably know, the perceptual qualities of pie charts are not great. In a single pie chart, it is usually harder than it should be to estimate and compare the values shown, especially when there are more than a few wedges and when there are a number of wedges reasonably close in size.
The Congressional Budget Office released its cost estimate report for the American Health Care Act yesterday. There are a few tables at the back summarizing the various budgetary and coverage effects of the proposed law. Of these, Table 4 is pretty interesting. The CBO “projected the average national premiums for a 21-year-old in the nongroup health insurance market in 2026 both under current law and under the AHCA. On the basis of those amounts, CBO calculated premiums for a 40-year-old and a 64-year-old, assuming that the person lives in a state that uses the federal default age-rating methodology”.
Update: Since writing this post, I repeatedly tried to delete the offending review from my profile, but Google Scholar kept re-inserting it as part of its automated trawl through its corpus of articles. The robots were determined to grant me these citations whether I wanted them or not. Finally, in January of 2018, John Fox got the citations he deserved and the error was fixed. True to form, the correction appeared out of the blue and its rationale was completely opaque.
I was playing with some county-level data from the U.S. general election, partly out of a spirit of honest inquiry and partly out of a feeling of morbid curiosity. Because I had some county-level census data to hand, I took a look at the results using some extremely basic demographic information—the two variables that structure America's ur-choropleths, namely population density and percent black. I focused on the counties that flipped from their vote in the 2012 general election.
Yesterday I had a conversation on Twitter with Josh Zumbrun that followed on from this tweet:
This is one of the most horrifying graphics I've ever seen:https://t.co/wM0VJZn0Wg pic.twitter.com/qaUaNFtRPl
— Josh Zumbrun (@JoshZumbrun) September 28, 2016 The striking maps he linked to tracked the rise in deaths due to drug-related overdoses over the past 15 years, caused in large part to the surge in use of heroin and synthetic opiates. The details are in the WSJ report on the problem.
Last year I wrote about vaccination exemptions in California kindergartens, drawing on school-level data provided by the state of California about the number of kindergarteners with “personal belief exemptions” (or PBEs) that allow them not to be vaccinated. Today I came across a ggplot package called ggbeeswarm that's designed to create a “beeswarm plot”, or a 1-D scatterplot with a bit of information about the density of the distribution. I had used geom_jitter to do something like this for one of my plots last year, but the geoms in ggbeeswarm are better.