<- function(station_id){
get_station_data <- paste0(
station_url "https://www1.ncdc.noaa.gov/pub/data/normals/1981-2010/products/auxiliary/station/",
station_id,".normals.txt"
)<- readLines(url(station_url))
file_lines
# find the line that has monthly temp maxes
<- file_lines %>% str_detect('mly-tmax-normal' ) %>% which()
loc_of_temp
<- c('Jan', 'Feb', "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
months
# strip out white space and split into a vector of 12 months
%>%
file_lines[loc_of_temp] strsplit('\\s+') %>%
1]] %>%
.[[-c(1,2)] %>% {
.[data_frame(temp = ., month = months, station_id = station_id)
} }
The United States of Seasons
I think my favorite detail about this map is the little splotch that is the Smoky Mountains on the western edge of North Carolina. Having spent a good amount of time there they really do have a different climate than the surrounding area.
The other day my girlfriend and I were talking about places we would like to live after grad school and one of the things that got brought up was how ‘seasonal’ the location is. She grew up in Long Beach, California, which essentially has no seasons, whereas I grew up near Ann Arbor, Michigan which very much has seasons. This got me to thinking: what does the country look like in the context of its seasonality? By ‘seasonality’ we mean how big of a shift in the weather is there during the year.
Being into data and maps I figured I would try and investigate this in a data-driven way. The final result is above this, but here I will walk you through the process that got me from question to map.
The data
The data that I wanted were the average (high) temperature over the year by different locations around the US. Turns out there are some nice APIs for getting this such as Wunderground’s but they all ended up requiring me to pay if I wanted to get anything close to the level of detail I wanted. Luckily, we have a federal government agency that likes to track things related to climate/weather in the National Oceanic and Atmospheric Administration (NOAA) and they have oodles of data.
After some digging I discovered what I was looking for: U.S. Climate Normals, a dataset for ~7,000 stations around the country that contains ‘normal’ or average measures of temperature, wind, etc for the dates from 1981-2010.
Looking in their “https” datasource they had a nice readme.txt
. After skimming this document I found that what I was looking for was in products/auxiliary/station
, there was a problem though. It is briefly alluded to by the line in the readme
corresponding to these files: Users wanting to visually inspect normals for just one or a few stations should use the files in this subdirectory.
“Visually inspect”? I don’t want that I want to read it in R! Let’s see what the problem could be….
products/auxiliary/station/CQC00914801.normals.txt
Station Name: MP ROTA AP
GHCN Daily ID: CQC00914801
Latitude: 14.1717
Longitude: 145.2428
Elevation: 179.2m
Temperature-Related Normals
Monthly JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
-----------------------------------------------------------------------------------------------------------
mly-tmax-normal 836R 832R 839R 852R 865R 873R 867R 865R 867R 865R 859R 848R
mly-tavg-normal 788R 782R 790R 801R 811R 817R 811R 808R 808R 808R 806R 800R
mly-tmin-normal 740R 733R 741R 750R 756R 761R 754R 750R 748R 751R 753R 752R
mly-dutr-normal 96R 99R 98R 102R 108R 112R 113R 115R 119R 114R 105R 95R
mly-cldd-normal 428R 371R 434R 453R 498R 501R 498R 488R 473R 490R 468R 465R
mly-htdd-normal 0R 0R 0R 0R 0R 0R 0R 0R 0R 0R 0R 0R
mly-tmax-stddev 8R 14R 13R 8R 6R 8R 6R 5R 14R 7R 7R 8R
mly-tavg-stddev 5R 11R 10R 5R 4R 6R 4R 5R 9R 5R 5R 6R
mly-htdd-base60 0R 0R 0R 0R 0R 0R 0R 0R 0R 0R 0R 0R
...
Turns out these files are just a bunch of tables concatinated into a single text file. Not exactly able to be read by read_csv
. All I want though is that top line: mly-tmax-normal...
. So we can go old school and read the file line by line and extract just that line.
This function will take a given station ID, download the text file corresponding to that station, and extract the monthly temperatures into a nice clean dataframe. Let’s test it out on the station we looked at two seconds ago.
get_station_data('CQC00914801') %>%
::kable() knitr
temp | month | station_id |
---|---|---|
836R | Jan | CQC00914801 |
832R | Feb | CQC00914801 |
839R | Mar | CQC00914801 |
852R | Apr | CQC00914801 |
865R | May | CQC00914801 |
873R | Jun | CQC00914801 |
867R | Jul | CQC00914801 |
865R | Aug | CQC00914801 |
867R | Sep | CQC00914801 |
865R | Oct | CQC00914801 |
859R | Nov | CQC00914801 |
848R | Dec | CQC00914801 |
The weird letters in the temp columns correspond to the quality of the measurement. Some stations hadn’t been up for all the years so their averages were not as exact. We ignore this here but in a more rigerous analysis we should probably look into them a bit more.
Wonderful! Now all we need is a list of the unique station ids and we’re good to go! Luckily that is in the data as well. station-inventories/allstations.txt
contained all the info on each station along with the geographic position of the station. Let’s grab it so we can get scraping!
<- "https://www1.ncdc.noaa.gov/pub/data/normals/1981-2010/station-inventories/allstations.txt"
info_loc
# taken from readme.txt
<- c(
column_names 'station_id','lat','lon','elevation',
'state','name','gsnflag', 'hcnflag',
'wmoid', 'method' )
<- read_table(info_loc, col_names = column_names) %>%
station_info select(station_id, lat, lon, state)
%>%
station_info head() %>%
::kable(align = c('l','r', 'r', 'r')) knitr
station_id | lat | lon | state |
---|---|---|---|
AQC00914000 | -14.3167 | -170.7667 | AS |
AQW00061705 | -14.3306 | -170.7136 | AS |
CAW00064757 | 44.2325 | -79.7811 | ON |
CQC00914080 | 15.2136 | 145.7497 | MP |
CQC00914801 | 14.1717 | 145.2428 | MP |
CQC00914855 | 15.1189 | 145.7294 | MP |
Now we just need to loop through all the ids and amass a nice dataset of their temperatures over the year. We are going to do this in a loop rather than something like purrr::map_df
because we can more easily handle errors and log progress.
<- station_info$station_id
stations
# initialize dataframe for first station
<- get_station_data(stations[1])
all_station_data # vector to hold the stations who we failed to get data for
<- c()
bad_stations
# go!
for(station in stations[-1]){
print(station)
tryCatch({
<- all_station_data %>%
all_station_data bind_rows(get_station_data(station))
error = function(e) {
},<- c(bad_stations, station)
bad_stations
}) }
This loop takes a long time to run. With each query taking ~1-2 seconds and there being ~7,000 unique stations. I let it sit overnight. Amazingly not a single station failed to download. When dealing with long running scripts like this hitting things on the internet that is practically unheard of.
# Exactly 90,000 rows.
%>% dim()
all_station_data ## [1] 90000 3
So we want a measure of seasonality. We will define this as how big of a delta there is between the warmest and coldest months for each station.
<- all_station_data %>%
swing_by_station mutate(
quality = str_extract(temp, '[A-Z]'),
temp = as.numeric(str_replace(temp, '[A-Z]', ''))
%>%
) group_by(station_id) %>%
summarise(
warmest_month = month[which.max(temp)],
coldest_month = month[which.min(temp)],
swing = (max(temp) - min(temp))/10
%>%
) right_join(station_info, by = 'station_id') %>%
filter((lon < -60 & lon > -130 & lat > 20) & !is.na(swing)) # we only want mainland us
%>%
swing_by_station head() %>%
::kable() knitr
station_id | warmest_month | coldest_month | swing | lat | lon | state |
---|---|---|---|---|---|---|
CAW00064757 | Jul | Jan | 51.9 | 44.2325 | -79.7811 | ON |
USC00010063 | Aug | Jan | 39.2 | 34.2553 | -87.1814 | AL |
USC00010160 | Jul | Jan | 35.5 | 32.9453 | -85.9481 | AL |
USC00010178 | Jul | Jan | 36.6 | 33.1272 | -88.1550 | AL |
USC00010252 | Jul | Jan | 30.8 | 31.3072 | -86.5225 | AL |
USC00010260 | Jul | Jan | 40.4 | 34.9092 | -87.2747 | AL |
We’re basically done now! Let’s just plot this and call it a day… Unfortnuately it doesn’t work that easily. Let’s see why.
ggplot(swing_by_station, aes(x = lon, y = lat, color = swing)) +
geom_point(alpha = 0.2)
Unsurprisingly, our data are unevenly distributed around the country, we want a nice smooth plot of the swing so we need to do some interpolation. Enter the geospatial side of R.
Geospatial interpolation
What we need to do is fit a model that can predict average swing given a lat-lon pair, along with generating a high-density evenly-spaced grid to run that model on.
Gridding it
To get the grid we will download a US shapefile from the Census’ Tiger repository and use the library rgdal
to read it in and exclude the given locations we don’t want to look at.
library(sp)
library(rgdal)
library(gstat)
library(maptools)
<- c(
not_wanted 'Alaska', 'American Samoa', 'Puerto Rico', 'Guam',
'Commonwealth of the Northern Mariana Islands United States Virgin Islands',
'Commonwealth of the Northern Mariana Islands',
'United States Virgin Islands', 'Hawaii')
<- rgdal::readOGR("../../media/seasons/us_shapefile/", "cb_2016_us_state_500k") %>%
us subset(!(NAME %in% not_wanted))
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/lucymcgowan/Dropbox (Wake Forest University)/wonderland-db/lfod/media/seasons/us_shapefile", layer: "cb_2016_us_state_500k"
## with 56 features
## It has 9 fields
## Integer64 fields read as strings: ALAND AWATER
plot(us)
Looks right to me! Now let’s calculate a grid of points over this.
# make a big grid that corresponds to the bounding box around out shapefile
<- makegrid(us, n = 5000) %>%
grid_us SpatialPoints(proj4string = CRS(proj4string(us))) %>%
# subset the grid such that it only has points that fall inside of our states.
.[us,]
plot(grid_us, type = 'p')
In the final product I bumped up the resolution (n) way higher to make the result smoother
Next we need to fit a model for interpolation. Here I use one of the most common techniques for this: inverse distance weighting. The task of fitting a model to these data is actually a fascinating one that I am not giving nearly enough weight in this post. Here we are just having as inputs lat and lon but we could easily make it more realistic by adding in elevation etc. Some models use 2-d gaussian processes to fit flexible models, some use splines, it’s a vast world and one I would love to dig into more!
# convert the data to a spacial dataframe.
::coordinates(swing_by_station) = ~lon + lat
sp
# make sure that the projection matches the grid we've built.
proj4string(swing_by_station) <- CRS(proj4string(us))
# fit basic inverse distance model
<- gstat::idw(
idw_model formula = swing ~ 1,
locations = swing_by_station,
newdata = grid_us,
idp = 2)
## [inverse distance weighted interpolation]
# extract predictions from the interpolation model
= as.data.frame(idw_model) %>% {# output is defined as a data table
interpolated_results names(.)[1:3] <- c("lon", "lat", "swing") # give names to the modeled variables
%>%
. } select(lon, lat, swing)
%>%
interpolated_results head() %>%
::kable() knitr
lon | lat | swing |
---|---|---|
-80.80 | 25.34 | 18.65760 |
-80.26 | 25.34 | 18.35323 |
-81.34 | 25.88 | 16.24864 |
-80.80 | 25.88 | 18.56805 |
-80.26 | 25.88 | 15.06180 |
-98.62 | 26.42 | 31.26166 |
We’re in business! We’ve now got an evenly spaced grid of points we can plot in a standard raster way!
ggplot(interpolated_results, aes(x = lon, y = lat)) +
geom_raster( aes(fill = swing)) +
xlim(-125, -65) + ylim(24, 51) +
theme_void() +
labs(fill = "Temp swing in degrees") +
borders('state', alpha = 0.1, size = 0.1)
Looks good but could use some improvement. The gradiations are rather smooth and thus it’s hard to see. Let’s bin the swings to 5 degree intervals and change the color pallet to be a bit more appropriate. I chose 5 degree intervals here because I figured that was about the limit of sensitivity for the average person. Totally arbitrary though.
%>%
interpolated_results mutate(swing_in_5s = round(swing/5)*5) %>%
ggplot(aes(x = lon, y = lat)) +
geom_raster( aes(fill = swing_in_5s)) +
scale_fill_distiller(palette = 'YlGn', direction = 1) +
xlim(-125, -65) + ylim(24, 51) +
theme_void() +
labs(fill = "Temp swing in degrees") +
borders('state', alpha = 0.1, size = 0.1)
There it is! In the top plot I have rerun the grid with a much higher n
and also tweaked some settings in the legend/scales to make the plot look a bit better. Hopefully now when you have some interesting geospatial data that is not evenly distributed you will know how to deal with it!
Code for top figure.
= guide_legend(
guide_tinker title.position = "top",
label.position="bottom",
label.hjust = 0.5,
direction = "horizontal",
keywidth = 1,
nrow = 1 )
= interpolated_swings$swing_in_5s %>% unique() %>% length()
colourCount = colorRampPalette(brewer.pal(9, "YlGnBu"))(colourCount)
palette
%>%
interpolated_swings mutate(swing_in_5s = factor(round(swing_in_5s)))%>%
ggplot(aes(x = lon, y = lat)) +
geom_raster( aes(fill = swing_in_5s)) +
scale_fill_manual(values = palette,
guide = guide_tinker) +
xlim(-130,-67) + ylim(24,50) +
theme_void() +
theme(
text = element_text(family = 'Montserrat'),
legend.justification = c(0,0),
legend.position = c(0,0.02),
legend.title = element_text(size = 10),
legend.text = element_text(size = 8),
legend.box.background = element_rect(fill = '#f0f0f0', color = NA)
+
) labs(
title = "The United States of Seasons",
subtitle = "Difference between the hottest and coldest days of the year",
fill = "Temp swing in degrees") +
borders('state', alpha = 0.1, size = 0.1)