NYTimes Map How-to

There was a recent email thread in the IsoStat listserv about a cool visualization that recently came out in the New York Times showing COVID-19 cases over time. This sparked a discussion about whether this was possible to recreate in R with ggplot, so of course I gave it a try!

library(tidycensus)
library(tidyverse)
library(geofacet)
library(zoo)

The plot shows cases per 100,000 by state, so I first needed to pull population data. To do that I used the tidycensus package. (If you don’t have an API key, you can get one here)

census_api_key("YOUR API KEY")

I pulled the population by state from 2019.

pop <- get_acs(geography = "state", variables = "B01003_001", year = 2019)

Then I pulled the cases in from the New York Times GitHub repo.

cases <- read_csv("https://github.com/nytimes/covid-19-data/raw/master/us-states.csv")

These need to be wrangled a bit:

  • The data come in as cumulative cases, and we want cases per day, so I create a new variable case for this purpose
  • There is a weirdo data point in Missouri on March 8th (it looks like there were 50,000 cases!) so I just removed that
  • I merged in the state populations that I pulled from the census
  • I created a 7 day rolling average
  • I created a variable for 7 day average per 100,000 people - this is the main variable used in the plot
  • I filtered to the range used in the original visualization - from Februrary 1st to April 4th
  • I merged in state abbreviations to make the plot easier to read
d <- cases %>%
  group_by(state) %>%
  mutate(case = c(cases[1], diff(cases))) %>%
  ungroup() %>%
  filter(!(date == as.Date("2021-03-08") & state == "Missouri")) %>% 
  left_join(pop, by = c("fips" = "GEOID")) %>%
  group_by(state) %>%
  arrange(date) %>%
  mutate(
    case_7 = rollmean(case, k = 7, fill = NA),
    case_per_100 = (case_7 / estimate) * 100000) %>%
  ungroup() %>%
  filter(date > as.Date("2021-01-31"), date < as.Date("2021-04-05"))

states <- tibble(state = state.name,
                 state_ = state.abb) %>%
   add_row(state = "District of Columbia", state_ = "DC")

d <- left_join(d, states, by = "state") %>%
  filter(!is.na(state_))

This plot had a neat feature that it filled in the area from the lowest point onward; to replicate this I found the date with the minimum cases per 100,000 and created a variable col to indicate any date after this point.

d <- d %>% 
  group_by(state) %>%
  slice_min(case_per_100) %>%
  slice(1) %>%
  mutate(min_date = date) %>%
  select(min_date, state) %>%
  left_join(d, by = "state") %>%
  mutate(col = ifelse(date >= min_date, "yes", "no"))

Now time to plot! The x-axis is date, the y-axis is case_per_100 and voila!

ggplot(d, aes(x = date, y = case_per_100)) +
  geom_line(color = "#BE2D22") + 
  geom_area(aes(alpha = col), fill = "#BE2D22") +
  scale_alpha_discrete(range = c(0, 0.7)) +
  facet_geo(~state_) + 
  theme_minimal() +
  labs(x = "", 
       y = "",
       title = "Cases per 100,000",
       subtitle = "Feb 1 - Apr 4, Red area indicates rise since lowest point of 2021",
       caption = "Note: Shows seven-day average") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.position = "none")


Lucy D'Agostino McGowan image
Lucy D'Agostino McGowan

Currently excited about: observational study methods, translational research, BB-8