Bringing the family together: Finding the center of geographic points in R

My husband’s family throws a family reunion every year and this year we’ve been tasked with co-planning it. We were trying to decide on the best location for everyone, so I embarked on a mission to find the center of all of our residences.

library(tidyverse)
library(leaflet)

Geocoding the locations

I began by putting together a quick .csv of the locations and number of family members in each.

df <- read_csv("locations.csv")
DT::datatable(df)

To get the longitude and latitude for each location, I used the opencage package. This 📦 requires an API key – you can get a free one from their website. For tips on adding environment variables to your .Renviron, check out Jenny Bryan’s Happy Git and GitHub for the useR. If you add this key to your .Renviron as OPENCAGE_KEY the package will automatically locate and use this.

Maëlle Salmon has a blog post for everything! I based my geocoding on the methods she used in her post on the R-Ladies global tour. The open_forward() function will give multiple results for each call, along with a confidence score for each. We can sort by the confidence score and take the top result. Since I just want to grab the longitude and latitude for the most likely location, I have made a small function to grab that, based on Maëlle’s blog post.

geocode_family <- function(city) {
  opencage::opencage_forward(city)$results %>%
    filter(components._type %in% c("city", "state")) %>%
    arrange(desc(confidence)) %>%
    select(lat = geometry.lat, lon = geometry.lng) %>%
    slice(1)
}
df <- df %>%
  bind_cols(map_df(.$city, geocode_family))

Finding the center

It turns out there are many ways to find the center of geographic locations. The easiest way is to average all of the longitude points and average all of the latitude points. This can optionally be weighted by the number of people at each point by taking a weighted mean. I call this the geographic_average().

geographic_average <- function(lon, lat, weight = NULL) {
  if (is.null(weight)) {
    weight <- rep(1, length(lon))
  }
  lon <- weighted.mean(lon, w = weight)
  lat <- weighted.mean(lat, w = weight)
  data.frame(lon = lon, lat = lat)
}

A bit of a more complex version uses vector algebra – first converting the points to radians, then sticking them on a Cartesian plane. This was well explained on Ask Dr. Math. I call this the geographic_midpoint().

geographic_midpoint <- function(lon, lat, weight = NULL) {
  if (is.null(weight)) {
    weight <- rep(1, length(lon))
  }
  # degrees to radians
  lat <- lat * pi / 180
  lon <- lon * pi / 180
  # cartesian coordinates
  x <- cos(lat) * cos(lon)
  y <- cos(lat) * sin(lon)
  z <- sin(lat)
  # weighted mean
  x <- weighted.mean(x, w = weight)
  y <- weighted.mean(y, w = weight)
  z <- weighted.mean(z, w = weight)
  # convert to lat and lon
  lon <- atan2(y, x) * 180 / pi
  hyp <- sqrt(x * x + y * y)
  lat <- atan2(z, hyp) * 180 / pi
  
  data.frame(lon = lon, lat = lat)
}

A final way to do this is using the geosphere package along with the centroid function. This calculates the center by projecting the points onto a spherical polygon and finding the centroid.

I’ll use all three methods to demonstrate their difference. Some of my husband’s family live in Europe, so I will calculate the center both with and without them.

df_nospain <- df %>%
  filter(city != "Madrid, Spain")

center <- list(
  "Unweighted Average" = geographic_average(
    lon = df$lon,
    lat = df$lat),
  "Weighted Average" = geographic_average(
    lon = df$lon,
    lat = df$lat,
    weight = df$number),
  "Unweighted Midpoint" = geographic_midpoint(
    lon = df$lon,
    lat = df$lat),
  "Weighted Midpoint" = geographic_midpoint(
    lon = df$lon, 
    lat = df$lat,
    weight = df$number),
  "Centroid" = data.frame(geosphere::centroid(
    df[, c("lon", "lat")])),
  "Unweighted Average (No Spain)" = geographic_average(
    lon = df_nospain$lon,
    lat = df_nospain$lat),
  "Weighted Average (No Spain)" = geographic_average(
    lon = df_nospain$lon,
    lat = df_nospain$lat,
    weight = df_nospain$number),
  "Unweighted Midpoint (No Spain)" = geographic_midpoint(
    lon = df_nospain$lon, 
    lat = df_nospain$lat),
  "Weighted Midpoint (No Spain)" = geographic_midpoint(
    lon = df_nospain$lon,
    lat = df_nospain$lat,
    weight = df_nospain$number),
  "Centroid (No Spain)" = data.frame(geosphere::centroid(
    df_nospain[, c("lon", "lat")]))
)
center <- bind_rows(center, .id = "calculation")

Visualizing

Now that we’ve calculated the various “center” points, we can plot them using leaflet. Since I have so many different things to visualize, I am going to use a color palette to help me choose the marker colors! You can install this package from GitHub with devtools::install_github("EmilHvitfeldt/paletteer") What a great excuse to try out Emil Hvitfeldt’s new paletteer 📦!

cols <- paletteer::paletteer_d("LaCroixColoR", 
                               "PassionFruit", 
                               n = 12, 
                               type = "continuous")
leaflet() %>%
  addTiles() %>%  
  setView(lng = -82, lat = 38, zoom = 04) %>%
  addCircleMarkers(lng = df$lon,
                   lat = df$lat,
                   color = cols[1],
                   label = df$city,
                   radius = 5) %>%
  addCircleMarkers(lng = center$lon,
                   lat = center$lat,
                   color = cols[2:12],
                   label = center$calculation,
                   radius = 8)

The various methods seemed to yield very similar results when not including Spain, but otherwise they are noticeably different. Without Spain, it looks like if we want the next family reunion to be equidistant (as the crow flies 🐦), then we will be somewhere in West Virginia or southern Ohio. If we incorporate a weight based on how many relatives live in each location, it looks like we’ll end up in Wayne National Forest (coincidentally, we held the reunion here 4 years ago!). If we include the European relatives, using the average or midpoint methods, we could end up in Pennsylvania, Maryland, or Delaware; otherwise we will be taking a cruise 🛳 across the Atlantic!


Lucy D'Agostino McGowan image
Lucy D'Agostino McGowan

Currently excited about: observational study methods, translational research, BB-8

comments powered by Disqus