I was teaching some dplyr and ggplot today. Because Coronavirus is in the, uh, air, I decided to work with the mortality data from http://mortality.org and have the students practice getting a bunch of data files into R and then plotting the resulting data quickly and informatively. We took a look at the years around the 1918 Influenza Epidemic and, after poking at the data for a little while, came to realize why it was called the Spanish Flu. Here’s some code you can run if you download the (freely available) 1x1 mortality files from <mortality.org>.
library(here)library(janitor)library(tidyverse)## Where the data is locallypath<-"data/Mx_1x1/"## Colors for latermy_colors<-c("#0072B2","#E69F00")## Some utility functions for cleaningget_country_name<-function(x){read_lines(x,n_max=1)%>%str_extract(".+?,")%>%str_remove(",")}shorten_name<-function(x){str_replace_all(x," -- "," ")%>%str_replace("The United States of America","USA")%>%snakecase::to_any_case()}make_ccode<-function(x){str_extract(x,"[:upper:]+((?=\\.))")}
First we’re going to make a little tibble of country codes, names, and associated file paths.
filenames<-dir(path=here(path),pattern="*.txt",full.names=TRUE)countries<-tibble(country=map_chr(filenames,get_country_name),cname=map_chr(country,shorten_name),ccode=map_chr(filenames,make_ccode),path=filenames)countries# A tibble: 49 x 4countrycnameccodepath<chr><chr><chr><chr>1AustraliaaustraliaAUS/Users/kjhealy/Documents/data/misc/lexi…
2AustriaaustriaAUT/Users/kjhealy/Documents/data/misc/lexi…
3BelgiumbelgiumBEL/Users/kjhealy/Documents/data/misc/lexi…
4BulgariabulgariaBGR/Users/kjhealy/Documents/data/misc/lexi…
5BelarusbelarusBLR/Users/kjhealy/Documents/data/misc/lexi…
6CanadacanadaCAN/Users/kjhealy/Documents/data/misc/lexi…
7Switzerla… switzerl… CHE/Users/kjhealy/Documents/data/misc/lexi…
8ChilechileCHL/Users/kjhealy/Documents/data/misc/lexi…
9CzechiaczechiaCZE/Users/kjhealy/Documents/data/misc/lexi…
10EastGerm… east_ger… DEUTE/Users/kjhealy/Documents/data/misc/lexi…
# … with 39 more rows
Next we ingest the data as a nested column, clean it a little, and subset it to those countries that we actually have mortality data for from the relevant time period.
mortality<-countries%>%mutate(data=map(path,~read_table(.,skip=2,na=".")))%>%unnest(cols=c(data))%>%clean_names()%>%mutate(age=as.integer(recode(age,"110+"="110")))%>%select(-path)%>%nest(data=c(year:total))## Subset to flu years / countriesflu<-mortality%>%unnest(cols=c(data))%>%group_by(country)%>%filter(min(year)<1918)flu# A tibble: 298,923 x 8# Groups: country [14]countrycnameccodeyearagefemalemaletotal<chr><chr><chr><dbl><int><dbl><dbl><dbl>1BelgiumbelgiumBEL184100.1520.1870.1692BelgiumbelgiumBEL184110.07490.07410.07453BelgiumbelgiumBEL184120.04170.03980.04084BelgiumbelgiumBEL184130.02550.02330.02445BelgiumbelgiumBEL184140.01850.01710.01786BelgiumbelgiumBEL184150.01390.01240.01327BelgiumbelgiumBEL184160.01280.01020.01158BelgiumbelgiumBEL184170.01090.008000.009449BelgiumbelgiumBEL184180.008810.007010.0078910BelgiumbelgiumBEL184190.008140.006960.00754# … with 298,913 more rows
For the purposes of labeling an upcoming plot, we’re going to make a little dummy dataset.
And now we filter the data to look only at female mortality between 1900 and 1929 for a series of specific ages: every decade from 10 years old to 60 years old. We’ll use that dummy dataset to label the first (but only the first) panel in the faceted plot we’re going to draw.