R on kieranhealy.org

gssr is now two packages: gssr and gssrdoc

Mon, 15 Apr 2024 16:18:57 -0400

Summary

My gssr package is now two packages: gssr and gssrdoc. They’re also available as binary packages via R-Universe which means they will install much faster.

The GSS is a big survey with a big codebook. Distributing it as an R package poses a few challenges. It’s too big for CRAN, of course, but that’s fine because CRAN is not a repository for datasets in any case. For some time, my gssr package has bundled the main data file, the panel datasets, and functions for getting the file for a particular year directly from NORC. Recently, I started integrating the codebook—or at least, summaries of every variable in the 1972-2022 data file—into the package. It’s a handy feature. It lets you look up GSS variables as if they were R functions:

Looking up a GSS variable

The main downside to doing this is that it makes a large package even larger. In addition, it takes a fair amount of time to install from source because more than 6,500 variables have to be documented during the installation. Providing binary packages would be much better. R OpenSci’s R-Universe provides a package-building service that rests on a bunch of GitHub Actions. But the resource constraints of GitHub’s runners meant that building a source package would fail on Ubuntu (specifically), and this meant that I couldn’t use it. To get around this I have split the package in two. There’s now gssr, which has the datasets (and the ability to fetch yearly datasets) exactly as before, and gssrdoc, which provides the integrated help. They are fully independent of one another. If you install both, you get exactly what gssr used to give you by itself. I think splitting them like this is worth it just because R-Universe can build package binaries of each now, and this means installation is much faster and you can use install.packages(). To install both, do:

# Install 'gssr' from 'ropensci' universe
install.packages('gssr', repos =
  c('https://kjhealy.r-universe.dev', 'https://cloud.r-project.org'))

# Also recommended: install 'gssrdoc' as well
install.packages('gssrdoc', repos =
  c('https://kjhealy.r-universe.dev', 'https://cloud.r-project.org'))

You can of course permanently add my or any other R-Universe repo to the default list of repos that install.packages() will search by using options() either in a project or in your .Rprofile. The R-Universe help repo has some additional details.

Note that if you install both packages you can just load library(gssr), but if you don’t want to load gssrdoc you can still query it at the console with e.g. ??polviews or ?gssrdoc::fefam.

Daily Average Sea Surface Temperature Animation

Fri, 12 Apr 2024 07:13:47 -0400

Yesterday evening I gave a talk about data visualization to Periodic Tables, a Science Cafe run by Misha Angrist. It was a lot of fun! Amongst other things, I made an animation of the NOAA Daily Sea Surface Temperature Graph from the other week. Here it is:

Here’s the static graph.

Global mean sea surface temperature 1981-2024

And because the hardy perennial of whether, for the sake of honesty and not Lying With Graphs, you should always have your y-axis go to zero also came up, I made a zero-baseline version of the average temperature graph.

Mean global sea surface temperature with a zero baseline on the y-axis

I’ve added these to the Github repo. In making the animation, I found a nice little wrinkle that let me put a ticking version of the year in the title even though year is not the frame_along driving the transition_reveal() that makes the animation. If I get a chance I’ll write this up separately.

Make Your Own NOAA Sea Temperature Graph

Thu, 04 Apr 2024 08:06:14 -0400

Sea-surface temperatures in the North Atlantic have been in the news recently as they continue to break records. While there are already a number of excellent summaries and graphs of the data, I thought I’d have a go at making some myself. The starting point is the detailed data made available by the National Centers for Environmental Information, part of NOAA. As always, the sheer volume of high-quality data agencies like this make available to the public is astonishing.

Getting the Data

The specific dataset is the NOAA 0.25-degree Daily Optimum Interpolation Sea Surface Temperature (OISST), Version 2.1, which takes a global network of daily temperature observations (from things like buoys and platforms, but also satellites), and then interpolates and aggregates them to a regular spatial grid of observations at 0.25 degrees resolution.

The data is available as daily global observations doing back to September 1st, 1981. Each day’s data is available as a single file in subdirectories organized by year-month. It’s all here. Each file is about 1.6MB in size. There are more than fifteen thousand of them.

We can make a folder called raw in our project and then get all the data, preserving its subdirectory structure, with a wget command like this:

bash

`1`	`wget --no-parent -r -l inf --wait 5 --random-wait 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/'`

This tries to be polite with the NOAA. (I think its webserver is also lightly throttled in any case, but it never hurts to be nice.) The switches --no-parent -r -l inf tell wget not to move upwards in the folder hierarchy, but to recurse downwards indefinitely. The --wait 5 --random-wait jointly enforce a five-second wait time between requests, randomly varying between 0.5 and 1.5 times the wait period. Downloading the files this way will take several days of real time downloading, though of course much less in actual file transfer time.

The netCDF data format

The data are netCDF files, an interesting and in fact quite nice self-documenting binary file format. These files have regular arrays of data of some n-dimensional size, e.g. latitude by longitude, with measures that can be thought of as being stacked on the array. (E.g. a grid of points with a measure at each point for surface temperature, sea ice extent, etc, etc). So you have the potential to have big slabs of data that you can then summarize by aggregating on some dimension or other.

The ncdf4 package can read them in R, though as it turns out we won’t use this package for the analysis. Here’s what one file looks like:

ncdf4::nc_open(all_fnames[1000])

File /Users/kjhealy/Documents/data/misc/noaa_ncei/raw/www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198405/oisst-avhrr-v02r01.19840527.nc (NC_FORMAT_NETCDF4):

     4 variables (excluding dimension variables):
        short anom[lon,lat,zlev,time]   (Chunking: [1440,720,1,1])  (Compression: shuffle,level 4)
            long_name: Daily sea surface temperature anomalies
            _FillValue: -999
            add_offset: 0
            scale_factor: 0.00999999977648258
            valid_min: -1200
            valid_max: 1200
            units: Celsius
        short err[lon,lat,zlev,time]   (Chunking: [1440,720,1,1])  (Compression: shuffle,level 4)
            long_name: Estimated error standard deviation of analysed_sst
            units: Celsius
            _FillValue: -999
            add_offset: 0
            scale_factor: 0.00999999977648258
            valid_min: 0
            valid_max: 1000
        short ice[lon,lat,zlev,time]   (Chunking: [1440,720,1,1])  (Compression: shuffle,level 4)
            long_name: Sea ice concentration
            units: %
            _FillValue: -999
            add_offset: 0
            scale_factor: 0.00999999977648258
            valid_min: 0
            valid_max: 100
        short sst[lon,lat,zlev,time]   (Chunking: [1440,720,1,1])  (Compression: shuffle,level 4)
            long_name: Daily sea surface temperature
            units: Celsius
            _FillValue: -999
            add_offset: 0
            scale_factor: 0.00999999977648258
            valid_min: -300
            valid_max: 4500

     4 dimensions:
        time  Size:1   *** is unlimited *** 
            long_name: Center time of the day
            units: days since 1978-01-01 12:00:00
        zlev  Size:1 
            long_name: Sea surface height
            units: meters
            actual_range: 0, 0
            positive: down
        lat  Size:720 
            long_name: Latitude
            units: degrees_north
            grids: Uniform grid from -89.875 to 89.875 by 0.25
        lon  Size:1440 
            long_name: Longitude
            units: degrees_east
            grids: Uniform grid from 0.125 to 359.875 by 0.25

    37 global attributes:
        title: NOAA/NCEI 1/4 Degree Daily Optimum Interpolation Sea Surface Temperature (OISST) Analysis, Version 2.1 - Final
        source: ICOADS, NCEP_GTS, GSFC_ICE, NCEP_ICE, Pathfinder_AVHRR, Navy_AVHRR
        id: oisst-avhrr-v02r01.19840527.nc
        naming_authority: gov.noaa.ncei
        summary: NOAAs 1/4-degree Daily Optimum Interpolation Sea Surface Temperature (OISST) (sometimes referred to as Reynolds SST, which however also refers to earlier products at different resolution), currently available as version v02r01, is created by interpolating and extrapolating SST observations from different sources, resulting in a smoothed complete field. The sources of data are satellite (AVHRR) and in situ platforms (i.e., ships and buoys), and the specific datasets employed may change over time. At the marginal ice zone, sea ice concentrations are used to generate proxy SSTs.  A preliminary version of this file is produced in near-real time (1-day latency), and then replaced with a final version after 2 weeks. Note that this is the AVHRR-ONLY DOISST, available from Oct 1981, but there is a companion DOISST product that includes microwave satellite data, available from June 2002
        cdm_data_type: Grid
        history: Final file created using preliminary as first guess, and 3 days of AVHRR data. Preliminary uses only 1 day of AVHRR data.
        date_modified: 2020-05-08T19:05:13Z
        date_created: 2020-05-08T19:05:13Z
        product_version: Version v02r01
        processing_level: NOAA Level 4
        institution: NOAA/National Centers for Environmental Information
        creator_url: https://www.ncei.noaa.gov/
        creator_email: oisst-help@noaa.gov
        keywords: Earth Science > Oceans > Ocean Temperature > Sea Surface Temperature
        keywords_vocabulary: Global Change Master Directory (GCMD) Earth Science Keywords
        platform: Ships, buoys, Argo floats, MetOp-A, MetOp-B
        platform_vocabulary: Global Change Master Directory (GCMD) Platform Keywords
        instrument: Earth Remote Sensing Instruments > Passive Remote Sensing > Spectrometers/Radiometers > Imaging Spectrometers/Radiometers > AVHRR > Advanced Very High Resolution Radiometer
        instrument_vocabulary: Global Change Master Directory (GCMD) Instrument Keywords
        standard_name_vocabulary: CF Standard Name Table (v40, 25 January 2017)
        geospatial_lat_min: -90
        geospatial_lat_max: 90
        geospatial_lon_min: 0
        geospatial_lon_max: 360
        geospatial_lat_units: degrees_north
        geospatial_lat_resolution: 0.25
        geospatial_lon_units: degrees_east
        geospatial_lon_resolution: 0.25
        time_coverage_start: 1984-05-27T00:00:00Z
        time_coverage_end: 1984-05-27T23:59:59Z
        metadata_link: https://doi.org/10.25921/RE9P-PT57
        ncei_template_version: NCEI_NetCDF_Grid_Template_v2.0
        comment: Data was converted from NetCDF-3 to NetCDF-4 format with metadata updates in November 2017.
        sensor: Thermometer, AVHRR
        Conventions: CF-1.6, ACDD-1.3
        references: Reynolds, et al.(2007) Daily High-Resolution-Blended Analyses for Sea Surface Temperature (available at https://doi.org/10.1175/2007JCLI1824.1). Banzon, et al.(2016) A long-term record of blended satellite and in situ sea-surface temperature for climate monitoring, modeling and environmental studies (available at https://doi.org/10.5194/essd-8-165-2016). Huang et al. (2020) Improvements of the Daily Optimum Interpolation Sea Surface Temperature (DOISST) Version v02r01, submitted.Climatology is based on 1971-2000 OI.v2 SST. Satellite data: Pathfinder AVHRR SST and Navy AVHRR SST. Ice data: NCEP Ice and GSFC Ice.

This is the file for May 27th, 1984. As you can see, there’s a lot of metadata. Each variable is admirably well-documented. The key information for our purposes is that we have a grid of 1440 by 720 lat-lon points. There are two additional dimensions—time and elevation (zlev)—but these are both just 1 for each particular file, because every file is observations at elevation zero on a particular day. There are four measures at each point: sea surface temperature anomalies, the standard deviation of the sea surface temperature estimate, sea ice concentration (as a percentage), and sea surface temperature (in degrees Celsius). So we have four bits of data for each grid point on our 1440 * 720 grid, which makes for just over 4.1 million data points per day since 1981.

Processing the Data

We read in the filenames and see how many we have:

1
2
3

all_fnames <- fs::dir_ls(here("raw"), recurse = TRUE, glob = "*.nc")
length(all_fnames)
#> [1] 15549

What we want to do is read in all this data and aggregate it so that we can take, for instance, the global average for each day and plot that trend for each year. Or perhaps we want to do that for specific regions of the globe, either defined directly by us in terms of some latitude and longitude polygon, or taken from the coordinates of some conventional division of the world’s oceans and seas into named areas.

Our tool of choice is the Terra package, which is designed specifically for this kind of data. It has a number of methods for conveniently aggregating and cutting into arrays of geospatial data. The netCDF4 package has a lot of useful features, too, but for the specific things we want to do Terra’s toolkit is quicker. One thing it can do, for example, is naturally aggregate over-time layers into single “bricks” of data, and then quickly slice, summarize, or calculate on these arrays.

So, let’s chunk our filenames into units of 25 days or so. This will make the multi-file operation we’re about to perform run faster, because we can read in and operate on a raster of 25 days at once instead of doing the same thing on 25 separate rasters. There’s probably an optimal chunk size, but I didn’t search too hard for it.

1
2
3

## This one gives you an unknown number of chunks each with approx n elements
chunk <- function(x, n) split(x, ceiling(seq_along(x)/n))
chunked_fnames <- chunk(all_fnames, 25)

Next, we write a function to process a raster file that terra creates. It calculates the area-weighted means of the layer variables. We have to weight our mean temperature calculation by area (instead of just directly taking the average of all the points) because the area of the degree-denominated grids gets smaller the closer you get to the poles. (This is because, some current views notwithstanding, the Earth is round.)

layerinfo <- tibble(
  num = c(1:4),
  raw_name = c("anom_zlev=0", "err_zlev=0",
               "ice_zlev=0", "sst_zlev=0"),
  name = c("anom", "err",
           "ice", "sst"))

process_raster <- function(fnames, crop_area = c(-80, 0, 0, 60), layerinfo = layerinfo) {

  tdf <- terra::rast(fnames) |>
    terra::rotate() |>   # Convert 0 to 360 lon to -180 to +180 lon
    terra::crop(crop_area) # Manually crop to a defined box.  Default is roughly N. Atlantic lat/lon box

  wts <- terra::cellSize(tdf, unit = "km") # For scaling. Because the Earth is round.

  # global() calculates a quantity for the whole grid on a particular SpatRaster
  # so we get one weighted mean per file that comes in
  out <- data.frame(date = terra::time(tdf),
                    means = terra::global(tdf, "mean", weights = wts, na.rm=TRUE))
  out$var <- rownames(out)
  out$var <- gsub("_.*", "", out$var)
  out <- reshape(out, idvar = "date",
                 timevar = "var",
                 direction = "wide")

  colnames(out) <- gsub("weighted_mean\\.", "", colnames(out))
  out
}

For a single file, this gives us one number for each variable:

# World box (60S to 60N)
world_crop_bb <- c(-180, 180, -60, 60)

process_raster(all_fnames[10000], crop_area = world_crop_bb)
#>                   date       anom       err       ice      sst
#> anom_zlev=0 2009-01-20 0.01397327 0.1873972 0.6823713 20.26344

For 25 filenames, 25 rows:

process_raster(chunked_fnames[[1]], crop_area = world_crop_bb)
#>                      date       anom       err       ice      sst
#> anom_zlev=0    1981-09-01 -0.1312008 0.2412722 0.4672954 20.12524
#> anom_zlev=0.1  1981-09-02 -0.1383695 0.2483428 0.4933853 20.11629
#> anom_zlev=0.2  1981-09-03 -0.1419441 0.2583364 0.4807980 20.11095
#> anom_zlev=0.3  1981-09-04 -0.1434012 0.2627574 0.5125643 20.10772
#> anom_zlev=0.4  1981-09-05 -0.1527941 0.2520100 0.4889709 20.09655
#> anom_zlev=0.5  1981-09-06 -0.1590382 0.2421610 0.5253917 20.08851
#> anom_zlev=0.6  1981-09-07 -0.1603969 0.2406726 0.4959906 20.08539
#> anom_zlev=0.7  1981-09-08 -0.1530743 0.2437756 0.5203092 20.09094
#> anom_zlev=0.8  1981-09-09 -0.1503720 0.2483605 0.5062930 20.09187
#> anom_zlev=0.9  1981-09-10 -0.1532902 0.2574440 0.5275545 20.08718
#> anom_zlev=0.10 1981-09-11 -0.1409007 0.2548919 0.5111582 20.09779
#> anom_zlev=0.11 1981-09-12 -0.1459493 0.2438222 0.5395167 20.09097
#> anom_zlev=0.12 1981-09-13 -0.1540702 0.2341866 0.5259677 20.08107
#> anom_zlev=0.13 1981-09-14 -0.1719063 0.2322755 0.5650545 20.06144
#> anom_zlev=0.14 1981-09-15 -0.1879679 0.2319289 0.5357815 20.04363
#> anom_zlev=0.15 1981-09-16 -0.2021128 0.2330142 0.5718586 20.02638
#> anom_zlev=0.16 1981-09-17 -0.2163771 0.2371551 0.5434053 20.00766
#> anom_zlev=0.17 1981-09-18 -0.2317916 0.2366315 0.5757664 19.98781
#> anom_zlev=0.18 1981-09-19 -0.2321086 0.2388878 0.5458579 19.98307
#> anom_zlev=0.19 1981-09-20 -0.2478310 0.2388981 0.5682817 19.96289
#> anom_zlev=0.20 1981-09-21 -0.2477164 0.2366739 0.5428888 19.95858
#> anom_zlev=0.21 1981-09-22 -0.2315305 0.2369557 0.5636612 19.97033
#> anom_zlev=0.22 1981-09-23 -0.2079270 0.2401278 0.5423280 19.98950
#> anom_zlev=0.23 1981-09-24 -0.1803567 0.2397868 0.5666913 20.01262
#> anom_zlev=0.24 1981-09-25 -0.1704838 0.2376401 0.5437584 20.01805

We need to do this for all the files so we get complete dataset. We take advantage of the futureverse to parallelize the operation, because doing this with 15,000 files is going to take a bit of time. Then at the end we clean it up a little bit.

season <-  function(in_date){
  br = yday(as.Date(c("2019-03-01",
                      "2019-06-01",
                      "2019-09-01",
                      "2019-12-01")))
  x = yday(in_date)
  x = cut(x, breaks = c(0, br, 366))
  levels(x) = c("Winter", "Spring", "Summer", "Autumn", "Winter")
  x
}

world_df <- future_map(chunked_fnames, process_raster,
                       crop_area = world_crop_bb) |>
  list_rbind() |>
  as_tibble() |>
  mutate(date = ymd(date),
         year = lubridate::year(date),
         month = lubridate::month(date),
         day = lubridate::day(date),
         yrday = lubridate::yday(date),
         season = season(date))

world_df
#> # A tibble: 15,549 × 10
#>    date         anom   err   ice   sst  year month   day yrday season
#>    <date>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 
#>  1 1981-09-01 -0.131 0.241 0.467  20.1  1981     9     1   244 Summer
#>  2 1981-09-02 -0.138 0.248 0.493  20.1  1981     9     2   245 Autumn
#>  3 1981-09-03 -0.142 0.258 0.481  20.1  1981     9     3   246 Autumn
#>  4 1981-09-04 -0.143 0.263 0.513  20.1  1981     9     4   247 Autumn
#>  5 1981-09-05 -0.153 0.252 0.489  20.1  1981     9     5   248 Autumn
#>  6 1981-09-06 -0.159 0.242 0.525  20.1  1981     9     6   249 Autumn
#>  7 1981-09-07 -0.160 0.241 0.496  20.1  1981     9     7   250 Autumn
#>  8 1981-09-08 -0.153 0.244 0.520  20.1  1981     9     8   251 Autumn
#>  9 1981-09-09 -0.150 0.248 0.506  20.1  1981     9     9   252 Autumn
#> 10 1981-09-10 -0.153 0.257 0.528  20.1  1981     9    10   253 Autumn
#> # ℹ 15,539 more rows

Now we have a time series for each of the variables daily from 1981 to yesterday.

Calculating values for all the world’s seas and oceans

We can do a little better though. What if we wanted to get these average values for the seas and oceans of the world? For that we’d need a map defining the conventional boundaries of those areas of water, which we’d then need to covert to raster format. After that, we’d slice up our global raster by the ocean and sea boundaries, and calculate averages for those areas.

I took the maritime boundaries from the IHO Sea Areas Shapefile.

seas <- sf::read_sf(here("raw", "World_Seas_IHO_v3"))

seas
#> Simple feature collection with 101 features and 10 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -180 ymin: -85.5625 xmax: 180 ymax: 90
#> Geodetic CRS:  WGS 84
#> # A tibble: 101 × 11
#>    NAME                   ID    Longitude Latitude min_X  min_Y max_X  max_Y   area MRGID                  geometry
#>    <chr>                  <chr>     <dbl>    <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl> <dbl>        <MULTIPOLYGON [°]>
#>  1 Rio de La Plata        33        -56.8   -35.1  -59.8 -36.4  -54.9 -31.5  3.18e4  4325 (((-54.94302 -34.94791, …
#>  2 Bass Strait            62A       146.    -39.5  144.  -41.4  150.  -37.5  1.13e5  4366 (((149.9046 -37.54325, 1…
#>  3 Great Australian Bight 62        133.    -36.7  118.  -43.6  146.  -31.5  1.33e6  4276 (((143.5325 -38.85535, 1…
#>  4 Tasman Sea             63        161.    -39.7  147.  -50.9  175.  -30    3.34e6  4365 (((159.0333 -30, 159.039…
#>  5 Mozambique Channel     45A        40.9   -19.3   32.4 -26.8   49.2 -10.5  1.39e6  4261 (((43.38218 -11.37021, 4…
#>  6 Savu Sea               48o       122.     -9.48 119.  -10.9  125.   -8.21 1.06e5  4343 (((124.5562 -8.223565, 1…
#>  7 Timor Sea              48i       128.    -11.2  123.  -15.8  133.   -8.18 4.34e5  4344 (((127.8623 -8.214911, 1…
#>  8 Bali Sea               48l       116.     -7.93 114.   -9.00 117.   -7.01 3.99e4  4340 (((115.7522 -7.143594, 1…
#>  9 Coral Sea              64        157.    -18.2  141.  -30.0  170.   -6.79 4.13e6  4364 (((168.4912 -16.79469, 1…
#> 10 Flores Sea             48j       120.     -7.51 117.   -8.74 123.   -5.51 1.03e5  4341 (((120.328 -5.510677, 12…
#> # ℹ 91 more rows

Then we rasterize the polygons with a function from terra:

## Rasterize the seas polygons using one of the nc files
## as a reference grid for the rasterization process
one_raster <- all_fnames[1]
seas_vect <- terra::vect(seas)
tmp_tdf_seas <- terra::rast(one_raster)["sst"] |>
  rotate()
seas_zonal <- rasterize(seas_vect, tmp_tdf_seas, "NAME")

Now we can use this data as the grid to do zonal calculations on our data raster. To use it in a parallelized calculation we need to wrap it, so that it can be found by the processes that future_map() will spawn. We write a new function to do the zonal calculation. It’s basically the same as the global one above.

seas_zonal_wrapped <- wrap(seas_zonal)

process_raster_zonal <- function(fnames) {

  d <- terra::rast(fnames)
  wts <- terra::cellSize(d, unit = "km") # For scaling

  layer_varnames <- terra::varnames(d) # vector of layers
  date_seq <- rep(terra::time(d)) # vector of dates

  # New colnames for use post zonal calculation below
  new_colnames <- c("sea", paste(layer_varnames, date_seq, sep = "_"))

  # Better colnames
  tdf_seas <- d |>
    terra::rotate() |>   # Convert 0 to 360 lon to -180 to +180 lon
    terra::zonal(unwrap(seas_zonal_wrapped), mean, na.rm = TRUE)
  colnames(tdf_seas) <- new_colnames

  # Reshape to long
  tdf_seas |>
    tidyr::pivot_longer(-sea,
                        names_to = c("measure", "date"),
                        values_to = "value",
                        names_pattern ="(.*)_(.*)") |>
    tidyr::pivot_wider(names_from = measure, values_from = value)

}

And we feed our chunked vector of filenames to it:

## Be patient
seameans_df <- future_map(chunked_fnames, process_raster_zonal) |>
  list_rbind() |>
  mutate(date = ymd(date),
         year = lubridate::year(date),
         month = lubridate::month(date),
         day = lubridate::day(date),
         yrday = lubridate::yday(date),
         season = season(date))

write_csv(seameans_df, file = here("data", "oceans_means_zonal.csv"))
save(seameans_df, file = here("data", "seameans_df.Rdata"), compress = "xz")

seameans_df
#> # A tibble: 1,570,449 × 11
#>    sea          date         anom   err   ice   sst  year month   day yrday season
#>    <chr>        <date>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 
#>  1 Adriatic Sea 1981-09-01 -0.737 0.167    NA  23.0  1981     9     1   244 Summer
#>  2 Adriatic Sea 1981-09-02 -0.645 0.176    NA  23.1  1981     9     2   245 Autumn
#>  3 Adriatic Sea 1981-09-03 -0.698 0.176    NA  22.9  1981     9     3   246 Autumn
#>  4 Adriatic Sea 1981-09-04 -0.708 0.248    NA  22.9  1981     9     4   247 Autumn
#>  5 Adriatic Sea 1981-09-05 -1.05  0.189    NA  22.5  1981     9     5   248 Autumn
#>  6 Adriatic Sea 1981-09-06 -1.02  0.147    NA  22.4  1981     9     6   249 Autumn
#>  7 Adriatic Sea 1981-09-07 -0.920 0.141    NA  22.4  1981     9     7   250 Autumn
#>  8 Adriatic Sea 1981-09-08 -0.832 0.140    NA  22.5  1981     9     8   251 Autumn
#>  9 Adriatic Sea 1981-09-09 -0.665 0.162    NA  22.6  1981     9     9   252 Autumn
#> 10 Adriatic Sea 1981-09-10 -0.637 0.268    NA  22.5  1981     9    10   253 Autumn
#> # ℹ 1,570,439 more rows

Now we have properly-weighted daily averages for every sea and ocean since September 1981. Time to make some pictures.

Global Sea Surface Mean Temperature Graph

To make a graph of the global daily mean sea surface temperature we can use the world_df object we made. The idea is to put the temperature on the y-axis, the day of the year on the x-axis, and then draw a separate line for each year. We highlight 2023 and (the data to date for) 2024. And we also draw a ribbon underlay showing plus or minus two standard deviations of the global mean.

colors <- ggokabeito::palette_okabe_ito()

## Labels for the x-axis
month_labs <- seameans_df |>
  filter(sea == "North Atlantic Ocean",
         year == 2023,
         day == 15) |>
  select(date, year, yrday, month, day) |>
  mutate(month_lab = month(date, label = TRUE, abbr = TRUE))


## Average and sd ribbon data
world_avg <- world_df |>
  filter(year > 1981 & year < 2012) |>
  group_by(yrday) |>
  filter(yrday != 366) |>
  summarize(mean_8211 = mean(sst, na.rm = TRUE),
            sd_8211 = sd(sst, na.rm = TRUE)) |>
  mutate(fill = colors[2],
         color = colors[2])

## Flag years of interest
out_world <- world_df |>
  mutate(year_flag = case_when(
    year == 2023 ~ "2023",
    year == 2024 ~ "2024",
    .default = "All other years"))


out_world_plot <- ggplot() +
  geom_ribbon(data = world_avg,
              mapping = aes(x = yrday,
                            ymin = mean_8211 - 2*sd_8211,
                            ymax = mean_8211 + 2*sd_8211,
                            fill = fill),
              alpha = 0.3,
              inherit.aes = FALSE) +
  geom_line(data = world_avg,
            mapping = aes(x = yrday,
                          y = mean_8211,
                          color = color),
            linewidth = 2,
            inherit.aes = FALSE) +
  scale_color_identity(name = "Mean Temp. 1982-2011, ±2SD", guide = "legend",
                       breaks = unique(world_avg$color), labels = "") +
  scale_fill_identity(name = "Mean Temp. 1982-2011, ±2SD", guide = "legend",
                      breaks = unique(world_avg$fill), labels = "") +
  ggnewscale::new_scale_color() +
  geom_line(data = out_world,
            mapping = aes(x = yrday, y = sst, group = year, color = year_flag),
            inherit.aes = FALSE) +
  scale_color_manual(values = colors[c(1,6,8)]) +
  scale_x_continuous(breaks = month_labs$yrday, labels = month_labs$month_lab) +
  scale_y_continuous(breaks = seq(19.5, 21.5, 0.5),
                     limits = c(19.5, 21.5),
                     expand = expansion(mult = c(-0.05, 0.05))) +
  geom_line(linewidth = rel(0.7)) +
  guides(
    x = guide_axis(cap = "both"),
    y = guide_axis(minor.ticks = TRUE, cap = "both"),
    color = guide_legend(override.aes = list(linewidth = 2))
  ) +
  labs(x = "Month", y = "Mean Temperature (°Celsius)",
       color = "Year",
       title = "Mean Daily Global Sea Surface Temperature, 1981-2024",
       subtitle = "Latitudes 60°N to 60°S; Area-weighted NOAA OISST v2.1 estimates",
       caption = "Kieran Healy / @kjhealy") +
  theme(axis.line = element_line(color = "gray30", linewidth = rel(1)),
        plot.title = element_text(size = rel(1.9)))

ggsave(here("figures", "global_mean.png"), out_world_plot, height = 7, width = 10, dpi = 300)

Global mean sea surface temperature 1981-2024

The North Atlantic

We can slice out the North Atlantic by name from seameans_df and make its graph in much the same way. For variety we can color most of the years blue, to lean into the “Great Wave off Kanagawa” (or Rockall?) vibe.

out_atlantic <- seameans_df |>
  filter(sea == "North Atlantic Ocean") |>
  mutate(year_flag = case_when(
    year == 2023 ~ "2023",
    year == 2024 ~ "2024",
    .default = "All other years"
  )) |>
  ggplot(aes(x = yrday, y = sst, group = year, color = year_flag)) +
  geom_line(linewidth = rel(1.1)) +
  scale_x_continuous(breaks = month_labs$yrday, labels = month_labs$month_lab) +
  scale_color_manual(values = colors[c(1,6,2)]) +
  guides(
    x = guide_axis(cap = "both"),
    y = guide_axis(minor.ticks = TRUE, cap = "both"),
    color = guide_legend(override.aes = list(linewidth = 2))
  ) +
  labs(x = "Month", y = "Mean Temperature (Celsius)",
       color = "Year",
       title = "Mean Daily Sea Surface Temperature, North Atlantic Ocean, 1981-2024",
       subtitle = "Gridded and weighted NOAA OISST v2.1 estimates",
       caption = "Kieran Healy / @kjhealy") +
  theme(axis.line = element_line(color = "gray30", linewidth = rel(1)))

ggsave(here("figures", "north_atlantic.png"), out_atlantic, height = 7, width = 10, dpi = 300)

The North Atlantic.

All the Seas

Finally we can of course go crazy with facets and just draw everything.

## All the world's oceans and seas
out <- seameans_df |>
  mutate(year_flag = case_when(
    year == 2023 ~ "2023",
    year == 2024 ~ "2024",
    .default = "All other years")) |>
  ggplot(aes(x = yrday, y = sst, group = year, color = year_flag)) +
  geom_line(linewidth = rel(0.5)) +
  scale_x_continuous(breaks = month_labs$yrday, labels = month_labs$month_lab) +
  scale_color_manual(values = colors[c(1,6,2)]) +
  guides(
    x = guide_axis(cap = "both"),
    y = guide_axis(minor.ticks = TRUE, cap = "both"),
    color = guide_legend(override.aes = list(linewidth = 1.4))
  ) +
  facet_wrap(~ reorder(sea, sst), axes = "all_x", axis.labels = "all_y") +
  labs(x = "Month of the Year", y = "Mean Temperature (Celsius)",
       color = "Year",
       title = "Mean Daily Sea Surface Temperatures, 1981-2024",
       subtitle = "Area-weighted 0.25° grid estimates; NOAA OISST v2.1; IHO Sea Boundaries",
       caption = "Data processed with R; Figure made with ggplot by Kieran Healy / @kjhealy") +
  theme(axis.line = element_line(color = "gray30", linewidth = rel(1)),
        strip.text = element_text(face = "bold", size = rel(1.4)),
        plot.title = element_text(size = rel(1.525)),
        plot.subtitle = element_text(size = rel(1.1)))

ggsave(here("figures", "all_seas.png"), out, width = 40, height = 40, dpi = 300)

All the world’s oceans and seas, because why not.

The full code for the data processing and the graphs is available on GitHub.

gssr Update

Mon, 01 Apr 2024 07:50:45 -0400

Update (April 15th 2024)

gssr is now two packages: gssr and gssrdoc. They’re also available as binary packages via R-Universe which means they will install much faster. See this post for details.

NORC released version 2a of the 1972-2022 General Social Survey cumulative file. I’ve updated {gssr}, an R package that makes it more convenient for R users to work with GSS Data. One handy feature of {gssr} is that it lets you see documentation for individual GSS variables as R help pages.

Details on every GSS variable are available in the R help system.

gssr is a data package, bundling several datasets into a convenient format. The relatively large size of the data in the package means it is not suitable for hosting on CRAN, the core R package repository.

Install direct from GitHub

You can install gssr from GitHub with:

`1`	`remotes::install_github("kjhealy/gssr")`

Load the package:

library(gssr)
#> Package loaded. To attach the GSS data, type data(gss_all) at the console.
#> For the codebook, type data(gss_dict).
#> For the panel data and documentation, type e.g. data(gss_panel08_long) and data(gss_panel_doc).
#> For help on a specific GSS variable, type ?varname at the console.

Single GSS years

You can quickly get the data for any single GSS year by using gss_get_yr() to download the data file from NORC and put it directly into a tibble.

gss18 <- gss_get_yr(2018)
#> Fetching: https://gss.norc.org/documents/stata/2018_stata.zip

gss18
#> # A tibble: 2,348 × 1,068
#>    year         id wrkstat   hrs1        hrs2        evwork      wrkslf  wrkgovt
#>    <dbl+lbl> <dbl> <dbl+lbl> <dbl+lbl>   <dbl+lbl>   <dbl+lbl>   <dbl+l> <dbl+l>
#>  1 2018          1 3 [with … NA(i) [iap]    41       NA(i) [iap] 2 [som… 2 [pri…
#>  2 2018          2 5 [retir… NA(i) [iap] NA(i) [iap]     1 [yes] 2 [som… 2 [pri…
#>  3 2018          3 1 [worki…    40       NA(i) [iap] NA(i) [iap] 2 [som… 2 [pri…
#>  4 2018          4 1 [worki…    40       NA(i) [iap] NA(i) [iap] 2 [som… 2 [pri…
#>  5 2018          5 5 [retir… NA(i) [iap] NA(i) [iap]     1 [yes] 2 [som… 2 [pri…
#>  6 2018          6 5 [retir… NA(i) [iap] NA(i) [iap]     1 [yes] 2 [som… 2 [pri…
#>  7 2018          7 1 [worki…    35       NA(i) [iap] NA(i) [iap] 2 [som… 1 [gov…
#>  8 2018          8 1 [worki…    89 [89+… NA(i) [iap] NA(i) [iap] 2 [som… 2 [pri…
#>  9 2018          9 1 [worki…    40       NA(i) [iap] NA(i) [iap] 1 [sel… 2 [pri…
#> 10 2018         10 1 [worki…    40       NA(i) [iap] NA(i) [iap] 2 [som… 2 [pri…
#> # ℹ 2,338 more rows
#> # ℹ 1,060 more variables: occ10 <dbl+lbl>, prestg10 <dbl+lbl>,
#> #   prestg105plus <dbl+lbl>, indus10 <dbl+lbl>, marital <dbl+lbl>,
#> #   martype <dbl+lbl>, divorce <dbl+lbl>, widowed <dbl+lbl>,
#> #   spwrksta <dbl+lbl>, sphrs1 <dbl+lbl>, sphrs2 <dbl+lbl>, spevwork <dbl+lbl>,
#> #   cowrksta <dbl+lbl>, cowrkslf <dbl+lbl>, coevwork <dbl+lbl>,
#> #   cohrs1 <dbl+lbl>, cohrs2 <dbl+lbl>, spwrkslf <dbl+lbl>, …

The GSS data comes in a labelled format, mirroring the way it is encoded for Stata and SPSS platforms. The numeric codes are the content of the column cells. The labeling information is stored as an attribute of the column.

Here’s a typical workflow for getting the data ready:

suppressPackageStartupMessages({
  library(tidyverse)
  library(survey)
  library(srvyr)
})


library(gssr)
#> Package loaded. To attach the GSS data, type data(gss_all) at the console.
#> For the codebook, type data(gss_dict).
#> For the panel data and documentation, type e.g. data(gss_panel08_long) and data(gss_panel_doc).
#> For help on a specific GSS variable, type ?varname at the console.

## Fn to capitalize strings nicely (from chartr)
capwords <- function(x, strict = FALSE) {
  cap <- function(x) paste(toupper(substring(x, 1, 1)),
                           {x <- substring(x, 2); if(strict) tolower(x) else x},
                           sep = "", collapse = " " )
  sapply(strsplit(x, split = " "), cap, USE.NAMES = !is.null(names(x)))
}

## The variables we want
cont_vars <- c("year", "id", "ballot", "age")
cat_vars <- c("race", "sex", "fefam")
wt_vars <- c("vpsu", "vstrat", "wtssps")
my_vars <- c(cont_vars, cat_vars, wt_vars)

## Get and clean up the 2018 data
gss_fam <- gss_get_yr(2018) |>
  select(all_of(my_vars)) |>
  mutate(
    # Convert all missing to NA
    across(everything(), haven::zap_missing),
    # Convert all weight vars to numeric
    across(all_of(wt_vars), as.numeric),
    # Convert year to numeric
    year = as.integer(year),
    # Make all categorical variables factors and relabel nicely
    across(all_of(cat_vars), forcats::as_factor),
    across(all_of(cat_vars), \(x) forcats::fct_relabel(x, capwords, strict = TRUE)),
    fefam = forcats::fct_recode(fefam, NULL = "IAP", NULL = "DK", NULL = "NA"),
    young = ifelse(age < 26, "Yes", "No"),
    fefam_d = forcats::fct_recode(fefam,
                                  Agree = "Strongly Agree",
                                  Disagree = "Strongly Disagree"),
    fefam_n = recode(fefam_d, "Agree" = 0, "Disagree" = 1))
#> Fetching: https://gss.norc.org/documents/stata/2018_stata.zip

## Take a look
gss_fam |>
  select(-vpsu, -vstrat) |>
  relocate(c(fefam_d, fefam_n), .after = fefam)
#> # A tibble: 2,348 × 11
#>     year    id ballot       age   race  sex   fefam fefam_d fefam_n wtssps young
#>    <int> <dbl> <dbl+lbl>    <dbl> <fct> <fct> <fct> <fct>     <dbl>  <dbl> <chr>
#>  1  2018     1 1 [ballot a] 43    White Male  Disa… Disagr…       1  1.91  No   
#>  2  2018     2 3 [ballot c] 74    White Fema… <NA>  <NA>         NA  0.915 No   
#>  3  2018     3 2 [ballot b] 42    White Male  Disa… Disagr…       1  0.609 No   
#>  4  2018     4 2 [ballot b] 63    White Fema… Disa… Disagr…       1  0.642 No   
#>  5  2018     5 3 [ballot c] 71    Black Male  <NA>  <NA>         NA  0.396 No   
#>  6  2018     6 1 [ballot a] 67    White Fema… Disa… Disagr…       1  0.529 No   
#>  7  2018     7 3 [ballot c] 59    Black Fema… <NA>  <NA>         NA  1.61  No   
#>  8  2018     8 3 [ballot c] 43    White Male  <NA>  <NA>         NA  0.672 No   
#>  9  2018     9 2 [ballot b] 62    White Fema… Stro… Disagr…       1  0.594 No   
#> 10  2018    10 2 [ballot b] 55    White Male  Disa… Disagr…       1  0.482 No   
#> # ℹ 2,338 more rows

## Put in a survey object
options(survey.lonely.psu = "adjust")
options(na.action="na.pass")

gss_svy <- gss_fam |>
  mutate(stratvar = interaction(year, vstrat)) |>
  as_survey_design(id = vpsu,
                   strata = stratvar,
                   weights = wtssps,
                   nest = TRUE)

gss_svy
#> Stratified 1 - level Cluster Sampling design (with replacement)
#> With (156) clusters.
#> Called via srvyr
#> Sampling variables:
#>  - ids: vpsu
#>  - strata: stratvar
#>  - weights: wtssps
#> Data variables: year (int), id (dbl), ballot (dbl+lbl), age (dbl+lbl), race
#>   (fct), sex (fct), fefam (fct), vpsu (dbl), vstrat (dbl), wtssps (dbl), young
#>   (chr), fefam_d (fct), fefam_n (dbl), stratvar (fct)

The Cumulative Data File

The GSS cumulative data file is large. It is not loaded by default when you invoke the package. (That is, gssr does not use R’s “lazy loading” facility. The data file is too big to do this without error.) To load one of the datasets, first load the library and then use data() to make the data available. For example, load the cumulative GSS file like this:

`1`	`data(gss_all)`

This will take a moment. Once it is ready, the gss_all object is available to use in the usual way:

gss_all
#> # A tibble: 72,390 × 6,694
#>    year         id wrkstat    hrs1        hrs2        evwork      occ   prestige
#>    <dbl+lbl> <dbl> <dbl+lbl>  <dbl+lbl>   <dbl+lbl>   <dbl+lbl>   <dbl> <dbl+lb>
#>  1 1972          1 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 205   50      
#>  2 1972          2 5 [retire… NA(i) [iap] NA(i) [iap]     1 [yes] 441   45      
#>  3 1972          3 2 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 270   44      
#>  4 1972          4 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap]   1   57      
#>  5 1972          5 7 [keepin… NA(i) [iap] NA(i) [iap]     1 [yes] 385   40      
#>  6 1972          6 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 281   49      
#>  7 1972          7 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 522   41      
#>  8 1972          8 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 314   36      
#>  9 1972          9 2 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 912   26      
#> 10 1972         10 1 [workin… NA(i) [iap] NA(i) [iap] NA(i) [iap] 984   18      
#> # ℹ 72,380 more rows
#> # ℹ 6,686 more variables: wrkslf <dbl+lbl>, wrkgovt <dbl+lbl>,
#> #   commute <dbl+lbl>, industry <dbl+lbl>, occ80 <dbl+lbl>, prestg80 <dbl+lbl>,
#> #   indus80 <dbl+lbl>, indus07 <dbl+lbl>, occonet <dbl+lbl>, found <dbl+lbl>,
#> #   occ10 <dbl+lbl>, occindv <dbl+lbl>, occstatus <dbl+lbl>, occtag <dbl+lbl>,
#> #   prestg10 <dbl+lbl>, prestg105plus <dbl+lbl>, indus10 <dbl+lbl>,
#> #   indstatus <dbl+lbl>, indtag <dbl+lbl>, marital <dbl+lbl>, …

In addition to the integrated help, information about the variables is also contained in the gss_dict object:

data(gss_dict)
gss_dict
#> # A tibble: 6,663 × 12
#>      pos variable label     missing var_doc_label value_labels var_text years   
#>    <int> <chr>    <chr>       <int> <chr>         <chr>        <chr>    <list>  
#>  1     1 year     gss year…       0 gss year for… [NA(d)] don… None     <NULL>  
#>  2     2 wrkstat  labor fo…      36 labor force … [1] working… 1. Last… <tibble>
#>  3     3 hrs1     number o…   30830 number of ho… [89] 89+ ho… 1a. If … <tibble>
#>  4     4 hrs2     number o…   70989 number of ho… [89] 89+ ho… 1b. If … <tibble>
#>  5     5 evwork   ever wor…   46944 ever work as… [1] yes; [2… 1c. If … <tibble>
#>  6     6 occ      r's cens…   48123 r's census o… [NA(d)] don… 2a. Wha… <tibble>
#>  7     7 prestige r's occu…   48123 r's occupati… [NA(d)] don… 2a. Wha… <tibble>
#>  8     8 wrkslf   r self-e…    4041 r self-emp o… [1] self-em… 2e. (Ar… <tibble>
#>  9     9 wrkgovt  govt or …   44311 govt or priv… [1] governm… 2f. (Ar… <tibble>
#> 10    10 commute  travel t…   71060 travel time … [97] 97+ mi… 2g. Abo… <tibble>
#> # ℹ 6,653 more rows
#> # ℹ 4 more variables: var_yrtab <list>, col_type <chr>, var_type <chr>,
#> #   var_na_codes <chr>

There are also a few convenience functions. For example, to see which years some questions were ask, use gss_which_years():

gss_all |> 
  gss_which_years(c(industry, indus80, wrkgovt, commute)) |> 
  print(n = Inf)

## # A tibble: 34 × 5
##    year      industry indus80 wrkgovt commute
##    <dbl+lbl> <lgl>    <lgl>   <lgl>   <lgl>  
##  1 1972      TRUE     FALSE   FALSE   FALSE  
##  2 1973      TRUE     FALSE   FALSE   FALSE  
##  3 1974      TRUE     FALSE   FALSE   FALSE  
##  4 1975      TRUE     FALSE   FALSE   FALSE  
##  5 1976      TRUE     FALSE   FALSE   FALSE  
##  6 1977      TRUE     FALSE   FALSE   FALSE  
##  7 1978      TRUE     FALSE   FALSE   FALSE  
##  8 1980      TRUE     FALSE   FALSE   FALSE  
##  9 1982      TRUE     FALSE   FALSE   FALSE  
## 10 1983      TRUE     FALSE   FALSE   FALSE  
## 11 1984      TRUE     FALSE   FALSE   FALSE  
## 12 1985      TRUE     FALSE   TRUE    FALSE  
## 13 1986      TRUE     FALSE   TRUE    TRUE   
## 14 1987      TRUE     FALSE   FALSE   FALSE  
## 15 1988      TRUE     TRUE    FALSE   FALSE  
## 16 1989      TRUE     TRUE    FALSE   FALSE  
## 17 1990      TRUE     TRUE    FALSE   FALSE  
## 18 1991      FALSE    TRUE    FALSE   FALSE  
## 19 1993      FALSE    TRUE    FALSE   FALSE  
## 20 1994      FALSE    TRUE    FALSE   FALSE  
## 21 1996      FALSE    TRUE    FALSE   FALSE  
## 22 1998      FALSE    TRUE    FALSE   FALSE  
## 23 2000      FALSE    TRUE    TRUE    FALSE  
## 24 2002      FALSE    TRUE    TRUE    FALSE  
## 25 2004      FALSE    TRUE    TRUE    FALSE  
## 26 2006      FALSE    TRUE    TRUE    FALSE  
## 27 2008      FALSE    TRUE    TRUE    FALSE  
## 28 2010      FALSE    TRUE    TRUE    FALSE  
## 29 2012      FALSE    FALSE   TRUE    FALSE  
## 30 2014      FALSE    FALSE   TRUE    FALSE  
## 31 2016      FALSE    FALSE   TRUE    FALSE  
## 32 2018      FALSE    FALSE   TRUE    FALSE  
## 33 2021      FALSE    FALSE   FALSE   FALSE  
## 34 2022      FALSE    FALSE   FALSE   FALSE

Pi Day Circles

Thu, 14 Mar 2024 07:30:03 -0400

Some Lissajous animations for Pi Day. Made with R, ggplot, and gganimate.

And the really not very efficient code that made them:

library(tidyverse)
library(gganimate)
library(transformr)

df_base <- tibble(
  id = seq(1, 1000, 1),
  t_vals = seq(0, 2 * pi, length.out = 1000))


circles <- function(t) {
  x01 <- cos(t * 1)
  y01 <- sin(t * 1)

  x02 <- cos(t * 2)
  y02 <- sin(t * 2)

  x03 <- cos(t * 3)
  y03 <- sin(t * 3)

  x04 <- cos(t * 4)
  y04 <- sin(t * 4)

  x05 <- cos(t * 5)
  y05 <- sin(t * 5)

  x06 <- cos(t * 6)
  y06 <- sin(t * 6)

  x07 <- cos(t * 7)
  y07 <- sin(t * 7)

  x08 <- cos(t * 8)
  y08 <- sin(t * 8)

  x09 <- cos(t * 9)
  y09 <- sin(t * 9)

  x10 <- cos(t * 10)
  y10 <- sin(t * 10)


  tibble(
    tick = seq_along(t),
    x01, x02, x03, x04, x05, x06, x07, x08, x09, x10,
    y01, y02, y03, y04, y05, y06, y07, y08, y09, y10
    )
}

df_out <- circles(t = df_base$t_vals)

df <- bind_cols(df_base, df_out) |>
  select(id, tick, everything()) |>
  pivot_longer(x01:x10, names_to = "x_group", values_to = "x") |>
  pivot_longer(y01:y10, names_to = "y_group", values_to = "y") |>
  mutate(x_group = str_remove(x_group, "x"),
         y_group = str_remove(y_group, "y")) |>
  unite("group_id", x_group, y_group, remove = FALSE)

out <- df |>
  ggplot(aes(x = x, y = y, color = group_id, group = group_id)) +
  geom_point(size = 3) +
  geom_path() +
  facet_grid(x_group ~ y_group) +
  coord_equal() +
  guides(color = "none") +
  theme_void() +
  transition_reveal(tick) +
  ease_aes("linear")



animate(out, duration = 30, fps = 24, height = 1080, width = 1080,
        renderer = ffmpeg_renderer())

anim_save(filename = "lissajous-fixed-lg-2.webm",
          height = 1080, width = 1080)

Dorling Cartograms

Wed, 06 Dec 2023 18:40:38 -0500

I was writing some examples for next semester’s dataviz class and shared one of them—a Dorling Cartogram—on the socials medias. Some people don’t like cartograms, some people do like cartograms; in conclusion, we live in a world of contrasts.

Also, some people asked for the code. So here it is, fwiw, after the pictures. These are not the most polished figures, but that is kind of the point, as we go through them in class and ~~indoctrinate students in the inflexible ideology of Cultural Marxism~~ discuss them like reasonable people and so on.

Percent Black by County

Percent Non-Hispanic White by County

Percent Asian by County

Percent Hispanic by County

And the code:

## Dorling Cartogram example with US Census data
## Requires you sign up for a free Census API key
## https://api.census.gov/data/key_signup.html
##

## Required packages
library(tidyverse)
library(tidycensus)
library(sf)
library(cartogram)
library(colorspace)

## Setup
options(tigris_use_cache = TRUE)

## Do this
census_api_key("YOUR API KEY HERE")
## or, to install in your .Rprofile follow the instructions at
## https://walker-data.com/tidycensus/reference/census_api_key.html

pop_names <- tribble(
  ~varname, ~clean,
  "B01003_001", "pop",
  "B01001B_001", "black",
  "B01001A_001", "white",
  "B01001H_001", "nh_white",
  "B01001I_001", "hispanic",
  "B01001D_001", "asian"
)

## Get the data
fips_pop <- get_acs(geography = "county",
                    variables = pop_names$varname,
                    cache_table = TRUE) |>
  left_join(pop_names, join_by(variable == varname)) |> 
  mutate(variable = clean) |> 
  select(-clean, -moe) |>
  pivot_wider(names_from = variable, values_from = estimate) |>
  rename(fips = GEOID, name = NAME) |>
  mutate(prop_pop = pop/sum(pop),
         prop_black = black/pop,
         prop_hisp = hispanic/pop,
         prop_white = white/pop,
         prop_nhwhite = nh_white/pop,
         prop_asian = asian/pop)

fips_map <- get_acs(geography = "county",
                    variables = "B01001_001",
                    geometry = TRUE,
                    shift_geo = FALSE,
                    cache_table = TRUE) |>
  select(GEOID, NAME, geometry) |>
  rename(fips = GEOID, name = NAME)


pop_cat_labels <- c("<5", as.character(seq(10, 95, 5)), "100")

counties_sf <- fips_map |>
  left_join(fips_pop, by = c("fips", "name")) |>
  mutate(black_disc = cut(prop_black*100,
                          breaks = seq(0, 100, 5),
                          labels = pop_cat_labels,
                          ordered_result = TRUE),
         hisp_disc = cut(prop_hisp*100,
                         breaks = seq(0, 100, 5),
                         labels = pop_cat_labels,
                         ordered_result = TRUE),
         nhwhite_disc = cut(prop_nhwhite*100,
                            breaks = seq(0, 100, 5),
                            labels = pop_cat_labels,
                            ordered_result = TRUE),
         asian_disc = cut(prop_asian*100,
                          breaks = seq(0, 100, 5),
                          labels = pop_cat_labels,
                          ordered_result = TRUE)) |>
  sf::st_transform(crs = 2163)


## Now we have
counties_sf

## Create the circle-packed version
## Be patient
county_dorling <- cartogram_dorling(x = counties_sf,
                                    weight = "prop_pop",
                                    k = 0.2, itermax = 100)


## Now draw the maps

## Black
out_black <- county_dorling |>
  filter(!str_detect(name, "Alaska|Hawaii|Puerto|Guam")) |>
  ggplot(aes(fill = black_disc)) +
  geom_sf(color = "grey30", size = 0.1) +
  coord_sf(crs = 2163, datum = NA) +
  scale_fill_discrete_sequential(palette = "YlOrBr",
                                 na.translate=FALSE) +
  guides(fill = guide_legend(title.position = "top",
                             label.position = "bottom",
                             nrow = 1)) +
  labs(
    subtitle = "Bubble size corresponds to County Population",
    caption = "Graph: @kjhealy. Source: Census Bureau / American Community Survey",
    fill = "Percent Black by County") +
  theme(legend.position = "top",
        legend.spacing.x = unit(0, "cm"),
        legend.title = element_text(size = rel(1.5), face = "bold"),
        legend.text = element_text(size = rel(0.7)),
        plot.title = element_text(size = rel(1.4), hjust = 0.15))

ggsave("figures/dorling-bl.png", out_black, height = 10, width = 12)

## Hispanic
out_hispanic <- county_dorling |>
  filter(!str_detect(name, "Alaska|Hawaii|Puerto|Guam")) |>
  ggplot(aes(fill = hisp_disc)) +
  geom_sf(color = "grey30", size = 0.1) +
  coord_sf(crs = 2163, datum = NA) +
  scale_fill_discrete_sequential(palette = "SunsetDark", na.translate=FALSE) +
  guides(fill = guide_legend(title.position = "top",
                             label.position = "bottom",
                             nrow = 1,
  )) +
  labs(fill = "Percent Hispanic by County",
       subtitle = "Bubble size corresponds to County Population",
       caption = "Graph: @kjhealy. Source: Census Bureau / American Community Survey") +
  theme(legend.position = "top",
        legend.spacing.x = unit(0, "cm"),
        legend.title = element_text(size = rel(1.5), face = "bold"),
        legend.text = element_text(size = rel(0.7)),
        plot.title = element_text(size = rel(1.4), hjust = 0.15))

ggsave("figures/dorling-hs.png", out_hispanic, height = 10, width = 12)

## NH White
out_white <- county_dorling |>
  filter(!str_detect(name, "Alaska|Hawaii|Puerto|Guam")) |>
  ggplot(aes(fill = nhwhite_disc)) +
  geom_sf(color = "grey30", size = 0.1) +
  coord_sf(crs = 2163, datum = NA) +
  scale_fill_discrete_sequential(palette = "BluYl", na.translate=FALSE) +
  guides(fill = guide_legend(title.position = "top",
                             label.position = "bottom",
                             nrow = 1,
  )) +
  labs(fill = "Percent Non-Hispanic White by County",
       subtitle = "Bubble size corresponds to County Population",
       caption = "Graph: @kjhealy. Source: Census Bureau / American Community Survey") +
  theme(legend.position = "top",
        legend.spacing.x = unit(0, "cm"),
        legend.title = element_text(size = rel(1.5), face = "bold"),
        legend.text = element_text(size = rel(0.7)),
        plot.title = element_text(size = rel(1.4), hjust = 0.15))

ggsave("figures/dorling-nhw.png", out_white, height = 10, width = 12)

## Asian
out_asian <- county_dorling |>
  filter(!str_detect(name, "Alaska|Hawaii|Puerto|Guam")) |>
  ggplot(aes(fill = asian_disc)) +
  geom_sf(color = "grey30", size = 0.1) +
  coord_sf(crs = 2163, datum = NA) +
  scale_fill_discrete_sequential(palette = "Purple-Ora", na.translate=FALSE) +
  guides(fill = guide_legend(title.position = "top",
                             label.position = "bottom",
                             nrow = 1,
  )) +
  labs(fill = "Percent Asian by County",
       subtitle = "Bubble size corresponds to County Population",
       caption = "Graph: @kjhealy. Source: Census Bureau / American Community Survey") +
  theme(legend.position = "top",
        legend.spacing.x = unit(0, "cm"),
        legend.title = element_text(size = rel(1.5), face = "bold"),
        legend.text = element_text(size = rel(0.7)),
        plot.title = element_text(size = rel(1.4), hjust = 0.15))

ggsave("figures/dorling-asian.png", out_asian, height = 10, width = 12)

gssr Update

Sat, 02 Dec 2023 11:25:12 -0500

Update (April 15th 2024)

gssr is now two packages: gssr and gssrdoc. They’re also available as binary packages via R-Universe which means they will install much faster. See this post for details.

The General Social Survey, or GSS, is one of the cornerstones of US public opinion research and one of the most-analyzed datasets in Sociology. My colleague Steve Vaisey aptly describes it as the Hubble Space Telescope of American social science. It is routinely used in research, in teaching, and as a reference point in discussions about changes in American society since the early 1970s. It is also a model of open, public data. The National Opinion Research Center already provides many excellent tools for working with the data, and has long made it freely available to researchers. Casual users of the GSS can examine the GSS Data Explorer, and social scientists can download complete datasets directly. At present, the GSS is provided to researchers in a variety of commercial formats: Stata (.dta), SAS, and SPSS (.sav). It’s not too difficult to get the data into R using the Haven package, but it can be a little annoying to have to do it repeatedly, or across projects. After doing it one too many times, a few years ago I got tired of it and I made an R package instead, gssr. Full details are available at the gssr homepage.

GSS ‘fefam’ variable trends over time

This update to the gssr package (version 0.4) provides the GSS Cumulative Data File (1972-2022), three GSS Three Wave Panel Data Files (for panels beginning in 2006, 2008, and 2010, respectively), and the 2020 panel file. This version of also integrates survey codebook information about variables directly into R’s help system, allowing them to be accessed via the help browser or from the console with ?, as if they were functions or other documented objects.

GSS ‘fefam’ variable information inside R’s help system.

The gssr package makes the GSS a little more accessible to users of R, the free software environment for statistical computing. In a small way it helps make the GSS even more open than it already is.

Flipbookr for Quarto

Thu, 10 Aug 2023 11:13:02 -0400

{{flipbookr}} is an R package written by Gina Reynolds. It’s very useful for teaching. It was developed for use with .Rmd files Xaringan and presently does not work with Quarto. I hacked-up a version of Flipbookr that does work with Quarto. Using it with Xaringan should be exactly the same as before. Right now it’s incomplete. I’ve just focused on getting the main user-facing function, chunk_reveal() to work. But this is also most of what the package does. Here’s a proof-of-concept Quarto presentation showing what’s working right now. You can go directly to the slides if you prefer.

The Naming of Stats

Mon, 19 Jun 2023 10:03:35 -0400

The Naming of Stats is a difficult matter,
     It isn’t just one of your holiday games;
You may think at first I’m as mad as a hatter
When I tell you, a stat must have THREE DIFFERENT NAMES.
First of all are the names where usage is informal,
     Such as Median, Estimate, Average, or Range,
Such as Variance, Quartile, or else Standard Normal
     All of them sensible everyday names.
There are fancier names that may be better-tasting,
     Some for the frequentists, some for the Bayes:
Such as Skew, or Kurtosis, Metropolis–Hastings—
     But all of them sensible everyday names.
But I tell you, a stat needs a name that’s obscurer,
     A name that’s misleading, and hard to construe,
Else how can it keep on confusing the reader,
     Or frustrate professors, or pass peer review?
Of names of this kind, there are many examples,
     Like Hierarchical, Robust, or Omega-hat,
Such as Marginal, Confidence, or just Weighted Sample,
     Names that always belong to more than one stat.
But above and beyond there’s still one name left over,
     And that is the name that you never will guess;
The thing that no human research can discover—
     But THE STAT MIGHT JUST KNOW, and be made to confess.
When you notice a stat getting quite widely cited,
     The reason, I tell you, is always because:
Its user’s engaged, or enraged, or excited
     At the prospect of pinning the root of all laws:
          That inviolable, friable
          Unidentifiable
Deep and inscrutable singular Cause.

Assault Deaths in the OECD 1960-2020

Thu, 30 Mar 2023 07:57:44 -0400

While we’re redoing some classics, here is the time series of assault deaths in the United States and eighteen other OECD countries from 1960 to 2020. Again, this is the sort of plot that you could choose to draw in a variety of ways depending on what it was that you wanted to emphasize. Code and data for this way of doing it are available on GitHub.

Assault deaths in the OECD, 1960-2020.