2020

May 23

Get Apple's Mobility Data

I’ve been maintaining covdata, an R package with a variety of COVID-related datasets in it. That means I’ve been pulling down updated files from various sources every couple of days. Most of these files are at static locations. While their internal structure may change occasionally, and maybe they’ve moved once or twice at most since I started looking at them, they’re generally at a stable location. Apple’s Mobility Data is an exception. The URL for the CSV file changes daily, and not just by incrementing the date or something like that. Instead the file path is a function of whatever version the web CMS is on, and its versioning moves around. Worse, the webpage is dynamically generated in Javascript when it’s requested, which means we can’t easily scrape it and just look for the URL embedded in the “Download the Data” button.

I resigned myself to doing the update manually for a bit, and then I got stuck in the weeds of using a headless browser from within R that could execute the Javascript and thus find the URL. But this was a huge pain. When I lamented my situation on Twitter, David Cabo pointed out to me that there’s an index.json file that’s stably-located and contains the information I needed to generate the URL of the day. Here’s how to do just that, and then pull in the data to a tibble.

The index.json file is just a string of metadata. It looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


{"basePath":"/covid19-mobility-data/2008HotfixDev38/v3",
 "mobilityDataVersion":"2008HotfixDev38:2020-05-21",
 "regions":{"en-us":{"jsonPath":"/en-us/applemobilitytrends.json",
                     "localeNamesPath":"/en-us/locale-names.json",
                     "csvPath":"/en-us/applemobilitytrends-2020-05-21.csv",
                     "initialPath":"/en-us/initial-data.json",
                     "shards":{"defaults":"/en-us/shards/defaults.json"}}}}

So, we grab this file (whose URL we know) and extract the information we want about the basePath and csvPath that point to the data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


get_apple_target <- function(cdn_url = "https://covid19-static.cdn-apple.com",
                             json_file = "covid19-mobility-data/current/v3/index.json") {
  tf <- tempfile(fileext = ".json")
  curl::curl_download(paste0(cdn_url, "/", json_file), tf)
  json_data <- jsonlite::fromJSON(tf)
  paste0(cdn_url, json_data$basePath, json_data$regions$`en-us`$csvPath)
}

## > get_apple_target()
## [1] "https://covid19-static.cdn-apple.com/covid19-mobility-data/2008HotfixDev38/v3/en-us/applemobilitytrends-2020-05-21.csv"

Then we can grab the data itself, with this function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


get_apple_data <- function(url = get_apple_target(),
                             fname = "applemobilitytrends-",
                             date = stringr::str_extract(get_apple_target(), "\\d{4}-\\d{2}-\\d{2}"),
                             ext = "csv",
                             dest = "data-raw/data",
                             save_file = c("n", "y")) {

  save_file <- match.arg(save_file)
  message("target: ", url)

  destination <- fs::path(here::here("data-raw/data"),
                          paste0("apple_mobility", "_daily_", date), ext = ext)

  tf <- tempfile(fileext = ext)
  curl::curl_download(url, tf)

  ## We don't save the file by default
  switch(save_file,
         y = fs::file_copy(tf, destination),
         n = NULL)

  janitor::clean_names(readr::read_csv(tf))
}

This will pull the data into a tibble, which you can then clean further (e.g., put into long format) as desired.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56


apple <- get_apple_data()

## target: https://covid19-static.cdn-apple.com/covid19-mobility-data/2008HotfixDev38/v3/en-us/applemobilitytrends-2020-05-21.csv
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   geo_type = col_character(),
##   region = col_character(),
##   transportation_type = col_character(),
##   alternative_name = col_character(),
##   `sub-region` = col_character(),
##   country = col_character(),
##   `2020-05-11` = col_logical(),
##   `2020-05-12` = col_logical()
## )
## See spec(...) for full column specifications.

apple

### A tibble: 3,625 x 136
##   geo_type region transportation_… alternative_name sub_region country x2020_01_13 x2020_01_14 x2020_01_15
##   <chr>    <chr>  <chr>            <chr>            <chr>      <chr>         <dbl>       <dbl>       <dbl>
## 1 country… Alban… driving          NA               NA         NA              100        95.3       101. 
## 2 country… Alban… walking          NA               NA         NA              100       101.         98.9
## 3 country… Argen… driving          NA               NA         NA              100        97.1       102. 
## 4 country… Argen… walking          NA               NA         NA              100        95.1       101. 
## 5 country… Austr… driving          AU               NA         NA              100       103.        104. 
## 6 country… Austr… transit          AU               NA         NA              100       102.        101. 
## 7 country… Austr… walking          AU               NA         NA              100       101.        102. 
## 8 country… Austr… driving          Österreich       NA         NA              100       101.        104. 
## 9 country… Austr… walking          Österreich       NA         NA              100       102.        106. 
##10 country… Belgi… driving          België|Belgique  NA         NA              100       101.        107. 
### … with 3,615 more rows, and 127 more variables: x2020_01_16 <dbl>, x2020_01_17 <dbl>, x2020_01_18 <dbl>,
###   x2020_01_19 <dbl>, x2020_01_20 <dbl>, x2020_01_21 <dbl>, x2020_01_22 <dbl>, x2020_01_23 <dbl>,
###   x2020_01_24 <dbl>, x2020_01_25 <dbl>, x2020_01_26 <dbl>, x2020_01_27 <dbl>, x2020_01_28 <dbl>,
###   x2020_01_29 <dbl>, x2020_01_30 <dbl>, x2020_01_31 <dbl>, x2020_02_01 <dbl>, x2020_02_02 <dbl>,
###   x2020_02_03 <dbl>, x2020_02_04 <dbl>, x2020_02_05 <dbl>, x2020_02_06 <dbl>, x2020_02_07 <dbl>,
###   x2020_02_08 <dbl>, x2020_02_09 <dbl>, x2020_02_10 <dbl>, x2020_02_11 <dbl>, x2020_02_12 <dbl>,
###   x2020_02_13 <dbl>, x2020_02_14 <dbl>, x2020_02_15 <dbl>, x2020_02_16 <dbl>, x2020_02_17 <dbl>,
###   x2020_02_18 <dbl>, x2020_02_19 <dbl>, x2020_02_20 <dbl>, x2020_02_21 <dbl>, x2020_02_22 <dbl>,
###   x2020_02_23 <dbl>, x2020_02_24 <dbl>, x2020_02_25 <dbl>, x2020_02_26 <dbl>, x2020_02_27 <dbl>,
###   x2020_02_28 <dbl>, x2020_02_29 <dbl>, x2020_03_01 <dbl>, x2020_03_02 <dbl>, x2020_03_03 <dbl>,
###   x2020_03_04 <dbl>, x2020_03_05 <dbl>, x2020_03_06 <dbl>, x2020_03_07 <dbl>, x2020_03_08 <dbl>,
###   x2020_03_09 <dbl>, x2020_03_10 <dbl>, x2020_03_11 <dbl>, x2020_03_12 <dbl>, x2020_03_13 <dbl>,
###   x2020_03_14 <dbl>, x2020_03_15 <dbl>, x2020_03_16 <dbl>, x2020_03_17 <dbl>, x2020_03_18 <dbl>,
###   x2020_03_19 <dbl>, x2020_03_20 <dbl>, x2020_03_21 <dbl>, x2020_03_22 <dbl>, x2020_03_23 <dbl>,
###   x2020_03_24 <dbl>, x2020_03_25 <dbl>, x2020_03_26 <dbl>, x2020_03_27 <dbl>, x2020_03_28 <dbl>,
###   x2020_03_29 <dbl>, x2020_03_30 <dbl>, x2020_03_31 <dbl>, x2020_04_01 <dbl>, x2020_04_02 <dbl>,
###   x2020_04_03 <dbl>, x2020_04_04 <dbl>, x2020_04_05 <dbl>, x2020_04_06 <dbl>, x2020_04_07 <dbl>,
###   x2020_04_08 <dbl>, x2020_04_09 <dbl>, x2020_04_10 <dbl>, x2020_04_11 <dbl>, x2020_04_12 <dbl>,
###   x2020_04_13 <dbl>, x2020_04_14 <dbl>, x2020_04_15 <dbl>, x2020_04_16 <dbl>, x2020_04_17 <dbl>,
###   x2020_04_18 <dbl>, x2020_04_19 <dbl>, x2020_04_20 <dbl>, x2020_04_21 <dbl>, x2020_04_22 <dbl>,
###   x2020_04_23 <dbl>, x2020_04_24 <dbl>, …
##

May 21

The Kitchen Counter Observatory

Every day begins in the same way. I get up. I make my coffee. I look at the data. Everything about this is absurd. To begin with, there’s the absurdity that everyone with a job like mine faces each day. Locked down at home with the kids, trying to get things done, unable to properly teach, write, or think. The household is like a little spacecraft, drifting in the void. Occasionally you venture outside to get supplies, or to check the shields. I find the days are speeding up now, because even though things drag from moment to moment, each twenty-four hour period is essentially identical. It reminds me of when my children were newborns. It’s a daily slog that, in retrospect, fuses into a gray blob almost impossible to recall in any sort of differentiated way.

Far better, of course, to have a mild case of lockdown ennui than to be in the situation of those directly fighting the pandemic, those whose health or livelihood has been devastated by it, or those who carry on out in the world, working to fulfil essential roles. I see some of them individually, at my door or in my social media. I see them in the aggregate in the data. There’s so much data. People working at international agencies, universities, newspapers, magazines, and state and local governments put out more each day, trying to capture the scale and scope of the pandemic. And it’s not just official agencies and businesses, either. One of the best sources of daily information on the pandemic in the United States is being run by a rapidly-assembled team of freelance journalists and volunteers. The COVID Tracking Project was brought into existence by the realization that the Centers for Disease Control were failing to provide the sort of daily updates on case counts and deaths that was part of their reason for existing.

City Driving Data from Apple

Driving activity over the past three months.

With a laptop, some free software, and a cup of coffee, I can examine what ought to seem like a staggering amount of information. Here, for example, is a picture showing what driving patterns have looked like every day in one hundred American cities over the past four months. As if that were a reasonable thing to be able to know while confined to your house! I drew it using information that Apple has been releasing to help researchers quantify the scope of the lockdown around the world. At this point, the full dataset has about half a million observations in it. Google is putting out a similar resource, about four times as large, that lets you see how busy different kinds of places are around the world over the same time period. This sort of thing doesn’t count as “big data” anymore. Back when I was a graduate student, I spent three days in a library manually copying down a few hundred numbers from a long-shelved report about blood donors. Now I sit here at home, surveying the scope of what’s being inflicted on people across the country and around the world as this disease spreads.

People sometimes think (or complain) that working with quantitative data like this inures you to the reality of the human lives that lie behind the numbers. Numbers and measures are crude; they pick up the wrong things; they strip out the meaning of what’s happening to real people; they make it easy to ignore what can’t be counted. There’s something to those complaints. But it’s mostly a lazy critique. In practice, I find that far from distancing you from questions of meaning, quantitative data forces you to confront them. The numbers draw you in. Working with data like this is an unending exercise in humility, a constant compulsion to think through what you can and cannot see, and a standing invitation to understand what the measures really capture—what they mean, and for whom. Those regular spikes in the driving data are the pulse of everyday life as people go out to have a good time at the weekend. That peak there is the Mardi Gras parade in New Orleans. That bump in Detroit was a Garth Brooks concert. Right across the country, that is the sudden shock of the shutdown the second weekend in March. It was a huge collective effort to buy time that, as it turns out, the federal government has more or less entirely wasted. And now through May here comes the gradual return to something like the baseline level of activity from January, proceeding much more quickly in some cities than in others.

I sit at my kitchen-counter observatory and look at the numbers. Before my coffee is ready, I can quickly pull down a few million rows of data courtesy of a national computer network originally designed by the government to be disaggregated and robust, because they were convinced that was what it would take for communication to survive a nuclear war. I can process it using software originally written by academics in their spare time, because they were convinced that sophisticated tools should be available to everyone for free. Through this observatory I can look out without even looking up, surveying the scale and scope of the country’s ongoing, huge, avoidable failure. Everything about this is absurd.

May 9

Covid Concept Generator

To save everyone some time, here’s a generator for the next five years of conceptual advances in social theory. Choose once at random from each column to secure your contribution.

Column 1 Column 2
Sequenced Stratification
Algorithmic Differences
Automated Capital
Robust Contagion
COVID Masking
Epidemiologic Others
Viral Politics
Rhizomatic Inequality
Infectious Sexualities
Compartmentalized Classification
Pandemic Causality
Epizootic Discrimination
Transmissible Polarization
Leucocyte Paradox
Intersectional Bodies
Corona Disparities
Liquid Isomorphism
Genomic Populism
Nucleotide Interdependence
Masked Colorism

April 28

New Orleans and Normalization

My post about Apple’s mobility data from a few days ago has been doing the rounds. (People have been very kind.) Unsurprisingly, one of the most thoughtful responses came from Dr. Drang, who wrote up a great discussion about the importance of choosing the right baseline if you’re going to be indexing change with respect to some time. His discussion of Small Multiples and Normalization is really worth your while.

Dr. Drang’s eye was caught by the case of Seattle, where the transit series was odd in a way that was related to Apple’s arbitrary choice of January 13th as the baseline for its series:

One effect of this normalization choice is to make the recent walking and driving requests in Seattle look higher than they should. Apple’s scores suggest that they are currently averaging 50–65% of what they were pre-COVID, but those are artificially high numbers because the norm was set artificially low.

A better way to normalize the data would be to take a week’s average, or a few weeks’ average, before social distancing and scale all the data with that set to 100.

I’ve been continuing to update my covdata package for R as Apple, Google, and other sources release more data. This week, Apple substantially expanded the number of cities and regions it is providing data for. The number of cities in the dataset went up from about 90 to about 150, for example. As I was looking at that data this afternoon, I saw that one of the new cities was New Orleans. Like Seattle, it’s an important city in the story of COVID-19 transmission within its region. And, as it turns out, even more so than Seattle, its series in this particular dataset is warped by the choice of start date. Here are three views of the New Orleans data: the raw series for each mode, the trend component of an STL time series decomposition, and the remainder component of the decomposition. (The methods and code are the same as previously shown.)

Raw New Orleans series

The New Orleans series as provided by Apple. Click or touch to zoom in.

New Orleans trend component

The trend component of the New Orleans series. Click or touch to zoom in.

New Orleans remainder component

The remainder component of the New Orleans series. Click or touch to zoom in.

Two things are evident right away. First, New Orleans has a huge spike in foot-traffic (and other movement around town) the weekend before Mardi Gras, and on Shrove Tuesday itself. The spike is likely accentuated by the tourist traffic. As I noted before, because Apple’s data is derived from the use of Maps for directions, the movements of people who know their way around town aren’t going to show up.

The second thing that jumps out about the series is that for most of January and February, the city is way, way below its notional baseline. How can weekday foot traffic, in particular, routinely be 75 percentage points below the January starting point?

The answer is that on January 13th, Clemson played LSU in the NCAA National Football Championship at the New Orleans Superdome. (LSU won 42-25.) This presumably brought a big influx of visitors to town, many of whom were using their iPhones to direct themselves around the city. Because Apple chose January 13th as its baseline day, this unusually busy Monday was marked as the “100” mark against which subsequent activity was indexed. Again, as with the strange case of European urban transit, a naive analysis, or even a “sophisticated” one where the researcher did not bother to look at the data first, might easily lead up the garden path.

Dr. Drang has already said most of what I’d say at this point about the value of checking the sanity of one’s starting point (and unlike me, he says it in Python) so I won’t belabor the point. You can see, though, just how huge Mardi Gras is in New Orleans. Were the data properly normalized, the Fat Tuesday spike would be far, far higher than most of the rest of the dataset.

April 23

Apple's COVID Mobility Data

Update

I’ve added a GitHub repository containing the code needed to reproduce the graphs in this post, as what’s shown here isn’t self-contained.

Apple recently released a batch of mobility data in connection with the COVID-19 pandemic. The data is aggregated from requests for directions in Apple Maps and is provided at the level of whole countries and also for a selection of large cities around the world. I folded the dataset into the covdata package for R that I’ve been updating, as I plan to use it this Fall in a course I’ll be teaching. Here I’ll take a quick look at some of the data. Along the way—as it turns out—I end up reminding myself of a lesson I’ve learned before about making sure you understand your measure before you think you understand what it is showing.

Apple released time series data for countries and cities for each of three modes of getting around: driving, public transit, and walking. The series begins on January 13th and, at the time of writing, continues down to April 20th. The mobility measures for every country or city are indexed to 100 at the beginning of the series, so trends are relative to that baseline. We don’t know anything about the absolute volume of usage of the Maps service.

Here’s what the data look like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


> apple_mobility
# A tibble: 39,500 x 5
   geo_type       region  transportation_type date       index
   <chr>          <chr>   <chr>               <date>     <dbl>
 1 country/region Albania driving             2020-01-13 100  
 2 country/region Albania driving             2020-01-14  95.3
 3 country/region Albania driving             2020-01-15 101. 
 4 country/region Albania driving             2020-01-16  97.2
 5 country/region Albania driving             2020-01-17 104. 
 6 country/region Albania driving             2020-01-18 113. 
 7 country/region Albania driving             2020-01-19 105. 
 8 country/region Albania driving             2020-01-20  94.4
 9 country/region Albania driving             2020-01-21  94.1
10 country/region Albania driving             2020-01-22  93.5
# … with 39,490 more rows

The index is the measured outcome, tracking relative usage of directions for each mode of transportation. Let’s take a look at the data for New York.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

raw_ny <- apple_mobility %>%
  filter(region == "New York City") %>%
  select(region:index) %>%
  rename(mode = transportation_type) %>%
  mutate(mode = tools::toTitleCase(mode),
         weekend = isWeekend(date),
         holiday = isHoliday(as.timeDate(date), listHolidays())) %>%
  mutate(max_day = ifelse(is_max(index), date, NA),
         max_day = as_date(max_day))

p_raw_ny <- ggplot(raw_ny, mapping = aes(x = date, y = index,
                                      group = mode, color = mode)) +
  geom_vline(data = subset(raw_ny, holiday == TRUE),
             mapping = aes(xintercept = date),
             color = my.colors("bly")[5], size = 2.9, alpha = 0.1) +
  geom_hline(yintercept = 100, color = "gray40") +
  geom_line() +
  geom_text_repel(aes(label = format(max_day, format = "%a %b %d")),
                  size = rel(2), nudge_x = 1, show.legend = FALSE) +
  scale_color_manual(values = my.colors("bly")) +
  labs(x = "Date", y = "Relative Mobility",
       color = "Mode",
       title = "New York City's relative trends in activity. Baseline data with no correction for weekly seasonality",
       subtitle = "Data are indexed to 100 for usage on January 13th 2020. Weekends shown as vertical bars. Date with highest relative activity index labeled.\nNote that in Apple's data 'Days' are defined as Midnight to Midnight PST.",
       caption = "Data: Apple. Graph: @kjhealy") +
  theme(legend.position = "top")

p_raw_ny
Relative Mobility in New York City. Touch or click to zoom.

Relative Mobility in New York City. Touch or click to zoom.

As you can see, we have three series. The weekly pulse of activity is immediately visible as people do more or less walking, driving, and taking the subway depending on what day it is. Remember that the data is based on requests for directions. So on the one hand, taxis and Ubers might be making that sort of request every trip. But people living in New York do not require turn-by-turn or step-by-step directions in order to get to work. They already know how to get to work. Even if overall activity is down at the weekends, requests for directions go up as people figure out how to get to restaurants, social events, or other destinations. On the graph here I’ve marked the highest relative value of requests for directions, which is for foot-traffic on February 22nd. I’m not interested in that particular date for New York, but when we look at more than one city it might be useful to see how the maximum values vary.

The big COVID-related drop-off in mobility clearly comes in mid-March. We might want to see just that trend, removing the “noise” of daily variation. When looking at time series, we often want to decompose the series into components, in order to see some underlying trend. There are many ways to do this, and many decisions to be made if we’re going to be making any strong inferences from the data. Here I’ll just keep it straightforward and use some of the very handy tools provided by the tidyverts (sic) packages for time-series analysis. We’ll use an STL decomposition to decompose the series into trend, seasonal, and remainder components. In this case the “season” is a week rather than a month or a calendar quarter. The trend is a locally-weighted regression fitted to the data, net of seasonality. The remainder is the residual left over on any given day once the underlying trend and “normal” daily fluctuations have been accounted for. Here’s the trend for New York.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

resids_ny <- apple_mobility %>%
  filter(region == "New York City") %>%
  select(region:index) %>%
  rename(mode = transportation_type) %>%
  mutate(mode = tools::toTitleCase(mode)) %>%
  as_tsibble(key = c(region, mode)) %>%
  model(STL(index)) %>%
  components() %>%
  mutate(weekend = isWeekend(date),
         holiday = isHoliday(as.timeDate(date), listHolidays())) %>%
  as_tibble() %>%
  mutate(max_day = ifelse(is_max(remainder), date, NA),
         max_day = as_date(max_day))

p_resid_ny <- ggplot(resids_ny, aes(x = date, y = remainder, group = mode, color = mode)) +
  geom_vline(data = subset(resids, holiday == TRUE),
             mapping = aes(xintercept = date),
             color = my.colors("bly")[5], size = 2.9, alpha = 0.1) +
  geom_line(size = 0.5) +
  geom_text_repel(aes(label = format(max_day, format = "%a %b %d")),
                  size = rel(2), nudge_x = 1, show.legend = FALSE) +
  scale_color_manual(values = my.colors("bly")) +
  labs(x = "Date", y = "Remainder", color = "Mode",
       title = "New York City, Remainder component for activity data",
       subtitle = "Weekends shown as vertical bars. Date with highest remainder component labeled.\nNote that in Apple's data 'Days' are defined as Midnight to Midnight PST.",
       caption = "Data: Apple. Graph: @kjhealy") +
  theme(legend.position = "top")
  
 p_resid_ny 
Trend component of the New York series. Touch or click to zoom.

Trend component of the New York series. Touch or click to zoom.

We can make a small multiple graph showing the raw data (or the components, as we please) for all the cities in the dataset if we like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16

p_base_all <- apple_mobility %>%
  filter(geo_type == "city") %>%
  select(region:index) %>%
  rename(mode = transportation_type) %>%
  ggplot(aes(x = date, y = index, group = mode, color = mode)) +
  geom_line(size = 0.5) +
  scale_color_manual(values = my.colors("bly")) +
  facet_wrap(~ region, ncol = 8) +
  labs(x = "Date", y = "Trend",
       color = "Mode",
       title = "All Modes, All Cities, Base Data",
       caption = "Data: Apple. Graph: @kjhealy") +
  theme(legend.position = "top")

p_base_all
Data for all cities. Touch or click to zoom.

Data for all cities. Touch or click to zoom.

This isn’t the sort of graph that’s going to look great on your phone, but it’s useful for getting some overall sense of the trends. Beyond the sharp declines everywhere—with slightly different timings, something that’d be worth looking at separately—a few other things pop out. There’s a fair amount of variation across cities by mode of transport and also by the intensity of the seasonal component. Some sharp spikes are evident, too, not always on the same day or by the same mode of transport. We can take a closer look at some of the cities of interest on this front.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

focus_on <- c("Rio de Janeiro", "Lyon", "Bochum - Dortmund", "Dusseldorf",
              "Barcelona", "Detroit", "Toulouse", "Stuttgart",
              "Cologne", "Hamburg", "Cairo", "Lille")

raw_ts <- apple_mobility %>%
  filter(geo_type == "city") %>%
  select(region:index) %>%
  rename(mode = transportation_type) %>%
  mutate(mode = tools::toTitleCase(mode),
         weekend = isWeekend(date),
         holiday = isHoliday(as.timeDate(date), listHolidays())) %>%
  filter(region %in% focus_on) %>%
  group_by(region) %>%
  mutate(max_day = ifelse(is_max(index), date, NA),
         max_day = as_date(max_day))
         
ggplot(raw_ts, mapping = aes(x = date, y = index,
                                      group = mode, color = mode)) +
  geom_vline(data = subset(raw_ts, holiday == TRUE),
             mapping = aes(xintercept = date),
             color = my.colors("bly")[5], size = 1.5, alpha = 0.1) +
  geom_hline(yintercept = 100, color = "gray40") +
  geom_line() +
  geom_text_repel(aes(label = format(max_day, format = "%a %b %d")),
                  size = rel(2), nudge_x = 1, show.legend = FALSE) +
  scale_color_manual(values = my.colors("bly")) +
  facet_wrap(~ region, ncol = 2) +
  labs(x = "Date", y = "Relative Mobility",
       color = "Mode",
       title = "Relative trends in activity, selected cities. No seasonal correction.",
       subtitle = "Data are indexed to 100 for each city's usage on January 13th 2020. Weekends shown as vertical bars.\nDate with highest relative activity index labeled.\nNote that in Apple's data 'Days' are defined as Midnight to Midnight PST.",
       caption = "Data: Apple. Graph: @kjhealy") +
  theme(legend.position = "top")         

Selected cities only. Touch or click to zoom.

Selected cities only. Touch or click to zoom.

Look at all those transit peaks on February 17th. What’s going on here? At this point, we could take a look at the residual or remainder component of the series rather than looking at the raw data, so we can see if something interesting is happening.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

resids <- apple_mobility %>%
  filter(geo_type == "city") %>%
  select(region:index) %>%
  rename(mode = transportation_type) %>%
  mutate(mode = tools::toTitleCase(mode)) %>%
  filter(region %in% focus_on) %>%
  as_tsibble(key = c(region, mode)) %>%
  model(STL(index)) %>%
  components() %>%
  mutate(weekend = isWeekend(date),
         holiday = isHoliday(as.timeDate(date), listHolidays())) %>%
  as_tibble() %>%
  group_by(region) %>%
  mutate(max_day = ifelse(is_max(remainder), date, NA),
         max_day = as_date(max_day))
         
ggplot(resids, aes(x = date, y = remainder, group = mode, color = mode)) +
  geom_vline(data = subset(resids, holiday == TRUE),
             mapping = aes(xintercept = date),
             color = my.colors("bly")[5], size = 1.5, alpha = 0.1) +
  geom_line(size = 0.5) +
  geom_text_repel(aes(label = format(max_day, format = "%a %b %d")),
                  size = rel(2), nudge_x = 1, show.legend = FALSE) +
  scale_color_manual(values = my.colors("bly")) +
  facet_wrap(~ region, ncol = 2) +
  labs(x = "Date", y = "Remainder", color = "Mode",
       title = "Remainder component for activity data (after trend and weekly components removed)",
       subtitle = "Weekends shown as vertical bars. Date with highest remainder component labeled.\nNote that in Apple's data 'Days' are defined as Midnight to Midnight PST.",
       caption = "Data: Apple. Graph: @kjhealy") +
  theme(legend.position = "top")         
Remainder components only. Touch or click to zoom.

Remainder components only. Touch or click to zoom.

We can see that there’s a fair amount of correspondence between the spikes in activity, but it’s not clear what the explanation is. For some cities things seem straightforward. Rio de Janiero’s huge spike in foot traffic corresponds to the Carnival parade around the week of Mardi Gras. As it turns out—thanks to some local informants for this—the same is true of Cologne, where Carnival season (Fasching) is also a big thing. But that doesn’t explain the spikes that repeatedly show up for February 17th in a number of German and French provincial cities. It’s a week too early. And why specifically in transit requests? What’s going on there? Initially I speculated that it might be connected to events like football matches or something like that, but that didn’t seem very convincing, because those happen week-in week-out, and if it were an unusual event (like a final) we wouldn’t see it across so many cities. A second possibility was some widely-shared calendar event that would cause a lot of people to start riding public transit. The beginning or end of school holidays, for example, seemed like a plausible candidate. But if that were the case why didn’t we see it in other, larger cities in these countries? And are France and Germany on the same school calendars? This isn’t around Easter, so it seems unlikely.

After wondering aloud about this on Twitter, the best candidate for an explanation came from Sebastian Geukes. He pointed out that the February 17th spikes coincide with Apple rolling out expanded coverage of many European cities in the Maps app. That Monday marks the beginning of public transit directions becoming available to iPhone users in these cities. And so, unsurprisingly, the result is a surge in people using Maps for that purpose, in comparison to when it wasn’t a feature. I say “unsurprisingly”, but of course it took a little while to figure this out! And I didn’t figure it out myself, either. It’s an excellent illustration of a rule of thumb I wrote about a while ago in a similar context.

As a rule, when you see a sharp change in a long-running time-series, you should always check to see if some aspect of the data-generating process changed—such as the measurement device or the criteria for inclusion in the dataset—before coming up with any substantive stories about what happened and why. This is especially the case for something susceptible to change over time, but not to extremely rapid fluctuations. … As Tom Smith, the director of the General Social Survey, likes to say, if you want to measure change, you can’t change the measure.

In this case, there’s a further wrinkle. I probably would have been quicker to twig what was going on had I looked a little harder at the raw data rather than moving to the remainder component of the time series decomposition. Having had my eye caught by Rio’s big Carnival spike I went to look at the remainder component for all these cities and so ended up focusing on that. But if you look again at the raw city trends you can see that the transit data series (the blue line) spikes up on February 17th but then sticks around afterwards, settling in to a regular presence, at quite a high relative level in comparison to its previous non-existence. And this of course is because people have begun to use this new feature regularly. If we’d had raw data on the absolute levels of usage in transit directions this would likely have been clear more quickly.

The tendency to launch right into what social scientists call the “Storytime!” phase of data analysis when looking at some graph or table of results is really strong. We already know from other COVID-related analysis how tricky and indeed dangerous it can be to mistakenly infer too much from what you think you see in the data. (Here’s a recent example.) Taking care to understand what your measurement instrument is doing really does matter. In this case, I think, it’s all the more important because with data of the sort that Apple (and also Google) have released, it’s fun to just jump into it and start speculating. That’s because we don’t often get to play with even highly aggregated data from sources like this. I wonder if, in the next year or so, someone doing an ecological, city-level analysis of social response to COVID-19 will inadvertently get caught out by the change in the measure lurking in this dataset.