library(tidyverse)
library(propensity)
library(halfmoon)
library(MatchIt)
Here we are going to look at several diagnostic plots that are helpful when attempting to answer a causal question. They can be used to visualize the target population, balance, and treatment effect heterogeneity.
I’ve simulated data to demonstrate the utility of the various plots. In each simulation, we have four pretreatment variables: var1
, var2
, var3
, and var4
, a treatment, t
, and an outcome y
. I have also fit a propensity score model for each and calculated ATE, ATT, and overlap weights (unsure what these are? Check out this post on propensity score weighting). I will also create three propensity score matched cohorts, each with a different caliper.
I use the following packages:
library(tidyverse)
library(propensity)
library(halfmoon)
library(MatchIt)
For brevity I’ve hidden the simulation code, but if you would like to see it to reproduce the analyses yourself, just click the Code toggle.
set.seed(8)
n < 1000
## Main Data 
var1 < rnorm(n, sd = 0.25)
var2 < rnorm(n, sd = 0.25)
var3 < rnorm(n, sd = 0.25)
var4 < rnorm(n, sd = 0.25)
e_x < 1 / (1 + exp((var1 + var2 + var3 + var4)))
t < rbinom(n, 1, e_x)
y1 < 0.5 * (var1 + var2 + var3 + var4) + rnorm(n)
y0 <  0.5 * (var1 + var2 + var3 + var4) + rnorm(n)
y_obs < t * y1 + (1  t) * y0
data < data.frame(
y = y_obs,
t = factor(t),
var1,
var2,
var3,
var4
)
data < data >
mutate(p = glm(t ~ var1 + var2 + var3 + var4, data = data, family = binomial) >
predict(type = "response"),
w_ate = wt_ate(p, t, .treated = 1),
w_ato = wt_ato(p, t, .treated = 1),
w_att = wt_att(p, t, .treated = 1)
)
match_1 < matchit(t ~ var1 + var2 + var3 + var4, data = data, caliper = 0.1)
match_2 < matchit(t ~ var1 + var2 + var3 + var4, data = data, caliper = 0.01)
match_3 < matchit(t ~ var1 + var2 + var3 + var4, data = data, caliper = 0.001)
matches < bind_matches(data, match_1, match_2, match_3)
## Nonlinear data 
e_x < 1 / (1 + exp( (var1 + var2 + var3 + 2 * I(var4 < 0.5) + 6 * I(var4 > 0.5))))
t < rbinom(n, 1, e_x)
y1 < 1 + var2 + var2 + var3 + var4
y0 < var1 + var2 + var3 + var4
y_obs < t * y1 + (1  t) * y0
data_nonlinear < data.frame(
y = y_obs,
t = factor(t),
var1,
var2,
var3,
var4
)
data_nonlinear < data_nonlinear >
mutate(p = glm(t ~ var1 + var2 + var3 + var4, data = data_nonlinear, family = binomial) >
predict(type = "response"),
w_ate = wt_ate(p, t, .treated = 1),
w_ato = wt_ato(p, t, .treated = 1),
w_att = wt_att(p, t, .treated = 1)
)
## Positivity violation data 
set.seed(2186)
var4 < rgamma(n, 1, 1.5)
e_x < 1 / (1 + exp( (var1 + var2 + var3 + var4 + 0.2 * var4^2 + 0.001 * var4^3)))
t < rbinom(n, 1, e_x)
data_positivity < data.frame(
t = factor(t),
var1,
var2,
var3,
var4
)
data_positivity < data_positivity >
mutate(p = glm(t ~ var1 + var2 + var3 + poly(var4, 3), data = data_positivity, family = binomial) >
predict(type = "response"),
w_ate = wt_ate(p, t, .treated = 1),
w_ato = wt_ato(p, t, .treated = 1),
w_att = wt_att(p, t, .treated = 1)
)
Targeting different causal estimands will yield different target populations (for a longer discussion of target populations see this post and check out this awesome preprint by Noah Griefer and Liz Stuart. For example, if you are interested in answering a question with the treated group in mind, an estimand that estimates the average treatment effect among the treated (ATT), will be appropriate. Targeting this estimand will lead to selecting unexposed individuals who match the characterstics of the treated population (whether via matching to these individuals or upweighting them in the sample). Mirrored histograms can be a nice way to visualize the distribution of your target population after incorporating the propensity score when either matching or weighting are used. These basic plots are simply histograms of the propensity score, stratified by exposure. “Mirroring” these histograms above and below the xaxis, can make it easier to compare regions of overlap.
Let’s take a look at an example. We can use the halfmoon
package to create these.
The cobalt
package is another excellent tool that can create many of these plots in R.
Below is the basic implementation of this mirrored histogram prior to incorporating the propensity score. On the top half of the visualization, we see the distribution of the propensity score in the treated group (blue); the bottom half displays the distribution among the controls (orange). Looking at this plot, I see good overlap (i.e. the two distributions overlap), and I do not see evidence of positivity violations.
ggplot(data, aes(p, fill = t, group = t)) +
geom_mirror_histogram(bins = 30) +
annotate("label", 0.5, 10, label = "control", color = "orange") +
annotate("label", 0.5, 10, label = "treated", color = "cornflower blue") +
scale_y_continuous(labels = abs) +
labs(x = "propensity score",
y = "count") +
theme(legend.position = "none")
Now let’s incorporate the propensity score. First, let’s see what this plot looks like if our target population is the whole population, meaning we are interested in estimate the ATE. I have added the following line of code to the ggplot
layers below: geom_mirror_histogram(bins = 30, aes(weight = w_ate), alpha = 0.5)
. Now, I can see the pseudopopulation that is created after implementing the propensity score weight. Notice the shape of the distributions match between the two groups (this is what the ATE weight is trying to do!). Looking at the figure below, I also can conclude that there aren’t any extreme weights.
ggplot(data, aes(p, fill = t, group = t)) +
geom_mirror_histogram(bins = 30, fill = "grey") +
geom_mirror_histogram(bins = 30, aes(weight = w_ate), alpha = 0.5) +
annotate("label", 0.5, 10, label = "control", color = "orange") +
annotate("label", 0.5, 10, label = "treated", color = "cornflower blue") +
scale_y_continuous(labels = abs) +
labs(x = "propensity score",
y = "count") +
theme(legend.position = "none")
These plots can be useful as a pedagogical tool to give a sense for how the different target estimands lead to different target populations. For example, let’s see what the pseudopopulation looks like after using the ATT weight. Notice in the figure below, the “weighted” pseudopopulation in the treated arm exactly overlaps with the actual distribution of the treated observations – this is exactly what an ATT weight does, everyone in the treated population receives a weight of 1. Now look at the bottom half of the figure – the distribution of the propensity scores in the control group now matches that of the treated – in the regions were there are fewer treated observations, the control observations are downweighted (where the propensity score is lower) and in the regions where there are more treated observations the control observations are upweighted (where the propensity score is higher).
ggplot(data, aes(p, fill = t, group = t)) +
geom_mirror_histogram(bins = 30, fill = "grey") +
geom_mirror_histogram(bins = 30, aes(weight = w_att), alpha = 0.5) +
annotate("label", 0.5, 10, label = "control", color = "orange") +
annotate("label", 0.5, 10, label = "treated", color = "cornflower blue") +
scale_y_continuous(labels = abs) +
labs(x = "propensity score",
y = "count") +
theme(legend.position = "none")
Finally, let’s see how an overlap (ATO) weight compares. Notice in the figure below all observations appear to be downweighted – the overlap weights are bounded by 0 and 1 (which means they have nice variance properties! There is no risk of having an extreme weight!). Also notice the shape of the distribution – it matches between the two groups.
ggplot(data, aes(p, fill = t, group = t)) +
geom_mirror_histogram(bins = 30, fill = "grey") +
geom_mirror_histogram(bins = 30, aes(weight = w_ato), alpha = 0.5) +
annotate("label", 0.5, 10, label = "control", color = "orange") +
annotate("label", 0.5, 10, label = "treated", color = "cornflower blue") +
scale_y_continuous(labels = abs) +
labs(x = "propensity score",
y = "count") +
theme(legend.position = "none")
I like to compare the overlap weights to the distribution after matching with a caliper, as they are both use to estimate a similar estimand. Here, I have created three matched cohorts, each with an increasingly smaller caliper. We can think of matching as an extreme form of weighting, where the observation will receive a weight of 1 if they are in the cohort and 0 otherwise. Here, I have created a dataset called matches
that has three columns with these indiciators match_1
, match_2
, and match_3
(you can see the code above by clicking the Code toggle in the Setup section).
Propensity score matching with a caliper means that you only consider matches within a prespecified distance of each other. Smaller calipers will result in fewer matches.
ggplot(matches, aes(p, fill = t, group = t)) +
geom_mirror_histogram(bins = 30, fill = "grey") +
geom_mirror_histogram(bins = 30, aes(weight = match_1), alpha = 0.5) +
annotate("label", 0.5, 10, label = "control", color = "orange") +
annotate("label", 0.5, 10, label = "treated", color = "cornflower blue") +
scale_y_continuous(labels = abs) +
labs(x = "propensity score",
y = "count") +
theme(legend.position = "none")
ggplot(matches, aes(p, fill = t, group = t)) +
geom_mirror_histogram(bins = 30, fill = "grey") +
geom_mirror_histogram(bins = 30, aes(weight = match_2), alpha = 0.5) +
annotate("label", 0.5, 10, label = "control", color = "orange") +
annotate("label", 0.5, 10, label = "treated", color = "cornflower blue") +
scale_y_continuous(labels = abs) +
labs(x = "propensity score",
y = "count") +
theme(legend.position = "none")
ggplot(matches, aes(p, fill = t, group = t)) +
geom_mirror_histogram(bins = 30, fill = "grey") +
geom_mirror_histogram(bins = 30, aes(weight = match_3), alpha = 0.5) +
annotate("label", 0.5, 10, label = "control", color = "orange") +
annotate("label", 0.5, 10, label = "treated", color = "cornflower blue") +
scale_y_continuous(labels = abs) +
labs(x = "propensity score",
y = "count") +
theme(legend.position = "none")
For demonstration purposes, let’s see what one of these plots looks like in a dataset that doesn’t have perfect overlap. Whoa! Look at that weight in the figure below. This is an example where we see a possible positivity violation (some observations of propensity scores very close to 1) and extreme weights (check out that control with a weight > 500!).
ggplot(data_positivity, aes(p, fill = t, group = t)) +
geom_mirror_histogram(bins = 30) +
geom_mirror_histogram(bins = 30, aes(weight = w_ate), alpha = 0.5) +
annotate("label", 0.6, 20, label = "control", color = "orange") +
annotate("label", 0.6, 20, label = "treated", color = "cornflower blue") +
scale_y_continuous(labels = abs) +
labs(x = "propensity score",
y = "count") +
theme(legend.position = "none")
Ok, once we’ve checked out the target population, we can see how balanced the exposure groups are after incorporating the propensity score.
A common way to look at balance is the standardized mean difference. This will tell us whether the (standardized) means are balanced between the treatment groups (we often target an absolute standardized mean difference less than 0.1 as a rule of thumb). We can use the tidy_smd
function to calculate the standardized mean differences between the exposure groups in our example dataset. Let’s see how they compare across the matched datasets.
matches_smd < tidy_smd(
matches,
var1:var4,
.group = t,
.wts = c(match_1, match_2, match_3)
)
A nice way to visualize these is a Love Plot (named for Thomas Love, who was one of the first folks to use them). In the halfmoon
package there is a geom_love
that will help create this as a layer in a ggplot
, or you can use the shorthand love_plot
. Below we can see that all three matched sets achieve balance with respect to demonstrating standardized mean differences across all pretreatment variables less than the rule of thumb (0.1). Each of the different “matches” denote a different caliper (from largest: match_1
to smallest: match_3
). We see here that using a smaller caliper seems to help balance var2
at the expense of var4
and var1
compared to the larger calipers.
ggplot(matches_smd, aes(abs(smd), variable, group = method, color = method)) +
geom_love() +
labs(x = "Absolute standardized mean difference")
Let’s look at another dataset. I have simulated data I am calling data_nonlinear
. Let’s check out the Love Plot for this data. This time I will use our propensity score weights.
weighted_smd < tidy_smd(
data_nonlinear,
var1:var4,
.group = t,
.wts = c(w_ate, w_att, w_ato)
)
ggplot(weighted_smd, aes(x = abs(smd), y = variable, group = method, color = method)) +
geom_love() +
labs(x = "Absolute standardized mean difference")
Great! Looks like any of our weighting choices will achieve balance on the mean. Check out the green line (the overlap weights) – the standardized mean differences are exactly 0! This is a feature of this weight, if the propensity score is fit using logistic regression, any variables included in the model will be perfectly balanced on the mean – COOL! BUT as you may have guessed by the name of this dataset, the mean does not tell the whole story. These variables are continuous, so being balanced on the mean does not guarantee that the whole distribution is balanced. To examine the distribution of a variable across treatment groups, we can use an empirical cumulative distribution plot.
The geom_ecdf
function in the halfmoon
package allows for you to visualize weighted empirical CDFs. Let’s first look at the unweighted eCDF for var4
. We are going to plot the range of var4
values on the xaxis and the proportion of var4
values that are less than the given xvalue on the yaxis (the empirical CDF), stratified by treatment. Looking at the figure below, we see gaps between the lines (meaning the two lines to not overlap, implying that the distributions differ).
ggplot(data_nonlinear, aes(x = var4, group = t, color = t)) +
geom_ecdf() +
labs(y = "Proportion <= x") +
theme(legend.position = "none")
Now we can compare this to the weighted eCDF to see if the propensity score weighting improves the balance – I’ll use the overlap weights for demonstration purposes. Hmm. We can see in the plot below that the lines cross (they are balanced on the mean, we knew that from the Love Plot), but there are still pretty large gaps across other portions of the distribution. This suggests that there are some nonlinear effects that the propensity score is failing to capture.
ggplot(data_nonlinear, aes(x = var4, group = t, color = t)) +
geom_ecdf(aes(weights = w_ato)) +
labs(y = "Proportion <= x") +
theme(legend.position = "none")
Let’s try to refit our propensity score model with a spline on the var4
variable, and then recreate our plot.
data_nonlinear < data_nonlinear >
mutate(p = glm(t == 1 ~ var1 + var2 + var3 + splines::ns(var4, 4),
data = data_nonlinear) >
predict(type = "response"),
w_ato = wt_ato(p, t, .treated = 1))
ggplot(data_nonlinear, aes(x = var4, group = t, color = t)) +
geom_ecdf(aes(weights = w_ato)) +
labs(y = "Proportion <= x") +
theme(legend.position = "none")
Much better!
Here is the last visual! This is a quick plot that can help explore possible treatment heterogeneity.
In this example dataset, the average treatment effect is 0. Let’s show that. There are lots of ways to estimate this, for example, we can use the ATE weights.
lm(y ~ t, data = data, weight = w_ate)
Call:
lm(formula = y ~ t, data = data, weights = w_ate)
Coefficients:
(Intercept) t1
0.02095 0.05364
Awesome! Now let’s create a plot to see if this effect is constant across the covariate space. One way to summarize the “covariate space” is the propensity score! This simple plot has the propensity score on the xaxis and the outcome on the yaxis. We then stratify by the treatment and look at a smoothed line in both groups.
ggplot(data, aes(x = p, y = y, color = t)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", formula = "y ~ x")
The lines cross! This indicates that there is treatment effect heterogeneity (in this particular case, when the propensity score is greater than 0.5, there is a positive treatment effect, and when less than 0.5 there is a negative treatment effect).
You can find a longer post about these plots here.
Just to see what it looks like when there is not a heterogeneous treatment effect, let’s check out the data_nonlinear
dataset (where I simulated a constant effect). Notice below the lines don’t cross, the width between them is constant across the covariate space.
ggplot(data_nonlinear, aes(x = p, y = y, color = t)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", formula = "y ~ x")
So there you have it! Four visualizations to add to your tool kit when attempting to answer causal questions.
We initially wrote this up as part of this special issue, you can find the paper here: D’Agostino McGowan, D’Agostino, and D’Agostino (2023) doi.org/10.1353/obs.2023.0008. Another special aspect of this paper is it is the first (and only!) paper I have written with both my father and grandfather ❤️
Here are the lines from the original paper:
Since covariance adjustment is effectively adjustment for the linear discriminant, plots of the responses and or residuals where is the value of predicted from the regression model used in the covariance adjustment, versus the linear discriminant are useful in identifying nonlinear or nonparallel response surfaces, as well as extrapolations, which might distort the estimate of the average treatment effect. Furthermore, such a plot is a bivariate display of multivariate adjustment, and as such might be useful for general presentation.
Generally, plots of responses and residuals from covariance analysis against the propensity score are more appropriate than against the discriminant, unless of course the covariates are multivariate normal with common covariance matrix in which case the propensity score is a monotone function of the discriminant.
So simple! They are just suggesting for continuous outcomes plotting the outcome (or residuals) against the propensity score. If you use a linear discriminant to estimate the propensity score, this is basically a residuals vs. fits plot (stratified by treatment group). Let’s look at an example
Let’s generate some data to see how this works. We have a standard normal confounder, :
library(tidyverse)
set.seed(8)
n < 1000
x < rnorm(n)
The “true” propensity score, that is the probability that given is as follows:
e_x < exp(x) / (1 + exp(x))
t < rbinom(n, 1, e_x)
The effect of the treatment, , is heterogeneous, that is, it depends on , as follows:
y1 < 0.5 * x + rnorm(n)
y0 <  0.5 * x + rnorm(n)
y_obs < t * y1 + (1  t) * y0
Ok, now we have our observed data, let’s put it all together:
data < data.frame(
y = y_obs,
t = t,
x = x
)
Let’s begin by estimating our causal effect by adjusting for directly in a model. Since is standard normal, this is actually going to be just fine.
On when covariate adjustment will work, from Rosenbaum & Rubin 1983: Cases where covariance adjustment has been seen to perform quite poorly are precisely those cases in which the linear discriminant is not a monotone function of the propensity score, so that covariance adjustment is implicitly adjusting for a poor approximation to the propensity score. In the case of univariate , the linear discriminant is a linear function of , whereas the propensity score may not be a monotone function of if the variances of in the treated and control groups are unequal.
model_noint < lm(y ~ t + x, data)
model_noint
Call:
lm(formula = y ~ t + x, data = data)
Coefficients:
(Intercept) t x
0.226609 0.003062 0.012806
Looks like the treatment effect controlling for is ~0. Let’s look at some diagnostics, for example we can look at the residuals vs. fits plot.
In an unfortunate oversight, in the paper I call this a biased estimate of the causal effect – it isn’t biased (it is the actual average treatment effect!) – It just isn’t terribly interesting as it doesn’t take the interaction into account.
ggplot(data, aes(x = fitted(model_noint), y = residuals(model_noint))) +
geom_point() +
geom_smooth(method = "loess", formula = "y ~ x", se = FALSE, span = 1) +
labs(x = "Fitted", y = "Residuals")
Ok, looks ok – maybe suggests a slightly nonlinear relationship. Now let’s do what Rosenbaum & Rubin suggest and examine this same plot stratified by treatment group.
ggplot(data, aes(x = fitted(model_noint), y = residuals(model_noint), color = factor(t))) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", formula = "y ~ x", se = FALSE, span = 1) +
labs(x = "Fitted", y = "Residuals", color = "treatment")
Whoa! Look at that heterogeneity. Looks like that “average” treatment effect really only represents those in the middle of the distribution of the confounder, (which of course is exactly how we created the variable!).
We can fit the “correct” model, including the interaction and examine the plot again.
model_int < lm(y ~ t + x + t*x, data)
model_int
Call:
lm(formula = y ~ t + x + t * x, data = data)
Coefficients:
(Intercept) t x t:x
0.01437 0.02939 0.48962 0.98383
Now let’s recreate the plot.
ggplot(data, aes(x = fitted(model_int), y = residuals(model_int), color = factor(t))) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", formula = "y ~ x", se = FALSE, span = 1) +
labs(x = "Fitted", y = "Residuals", color = "treatment")
Beautiful!
Instead of using covariate adjustment, we could estimate our causal effect using propensity scores.
Note, here I am using logistic regression to fit the propensity score model, if I had used LDA instead and adjusted for that we would get the exact same answer as the previous section
Let’s begin by fitting a propensity score model:
data < data >
mutate(p = glm(t ~ x, data = data, family = binomial) >
predict(type = "response")
)
We can examine the distribution of the propensity score. Let’s use the halfmoon
package to create a mirrored histogram 😎
Malcolm Barrett, Travis Gerke, and I have been working on a new suite of R packages for helping complete common causal inference tasks, check out our most recent posts here: https://rcausal.github.io/rcausalblog/
library(halfmoon)
ggplot(data, aes(p, fill = factor(t))) +
geom_mirror_histogram(bins = 30) +
labs(x = "propensity score",
fill = "treatment")
Great! Looks good.
Now let’s estimate our average treatment effect. We can use the propensity
package to calculate ATE weights:
data < data >
mutate(wt_ate = propensity::wt_ate(p, t, .treated = 1))
model < lm(y ~ t, weights = wt_ate, data)
model
Call:
lm(formula = y ~ t, data = data, weights = wt_ate)
Coefficients:
(Intercept) t
0.05113 0.02105
Great (although again we know this isn’t representative of most people in this population since we’ve induced some serious heterogeneity!). Now let’s create the plot suggested in the 1983 paper, examining the outcome vs the propensity score.
ggplot(data, aes(x = p, y = y, color = factor(t))) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", formula = "y ~ x", se = FALSE, span = 1) +
labs(x = "propensity score", color = "treatment")
Whoa! Look at that effect heterogeneity! Looks like the only place where the average treatment effect (~0) is representative is when the propensity score is 0.5, when it is less, the effect is negative, and greater is it positive.
So there you have it! A simple plot to give a little more information than a single number summary (the average treatment effect). When estimating causal effects using covariate adjustment, a sensible and straightforward diagnostic plot to use is the residuals versus fits plot stratified by treatment assignment. If using a propensity score (as in this example) you can look at the outcome versus the propensity score. As mentioned in the paper, ideally these plots would be generated during the exploratory phase of the modeling process and once in the confirmatory phase the correct relationship between the treatment, confounders, and outcome would be well understood allowing the correct model to be prespecified. Perhaps this can be added to the set of routine diagnostic tools used when assessing propensity score + outcome models in the future.
Without doing any of the math, I’d guess the assumption of correctly spec the model also has a bigger impact in the CC analysis.
You need correct spec in MI, twice, but trade off that potential bias for higher prec.
This is a great question! I am going to investigate via a small simulation (so the answer could be “it depends”, but at least we will know how it seems to work in this very simple case) 😆.
Ok so here I have some predictor, x
that is missing 50% of the time, dependent on c_x
and c_y
. The right imputation model would have c_x
, the right outcome model needs c_y
. Unfortunately, we only have access to one, which we will try to use in our imputation model (and outcome model). Let’s see whether two (wrong) models are better than one!
A “correct” model will be one that estimates that the coefficient for x
is 1.
library(tidyverse)
library(broom)
library(mice)
n < 1000
set.seed(928)
data < tibble(
c_x = rnorm(n, sd = 0.71),
x = c_x + rnorm(n, sd = 0.71),
c_y = rnorm(n),
y = x + c_y + rnorm(n),
noise = rnorm(n),
x_miss = rbinom(n, 1, 1 / (1 + exp((c_x + c_y)))),
x_obs = ifelse(
x_miss,
NA,
x
)
)
c_x
Ok first let’s look at the whole dataset.
mod_full_c_x < lm(y ~ x + c_x, data = data) >
tidy(conf.int = TRUE) >
filter(term == "x") >
select(estimate, conf.low, conf.high)
mod_full_c_x
# A tibble: 1 × 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 1.01 0.877 1.14
This checks out! c_x
basically does nothing for us here, but because c_y
is not actually a confounder (it just informs the missingness & y
, which we aren’t observing here), we are just fine estimating our “wrong” model in the fully observed data. Now let’s do the “complete cases” analysis.
data_cc < na.omit(data)
mod_cc_c_x < lm(y ~ x + c_x, data = data_cc) >
tidy(conf.int = TRUE) >
filter(term == "x") >
select(estimate, conf.low, conf.high)
mod_cc_c_x
# A tibble: 1 × 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 0.999 0.812 1.19
This does fine! Now let’s do some imputation. I am going to use the mice
package.
imp_data_c_x < mice(
data,
m = 5,
method = "norm.predict",
formulas = list(x_obs ~ c_x),
print = FALSE)
Ok let’s compare how this model does “alone”.
mod_imp_c_x < with(imp_data_c_x, lm(y ~ x_obs)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_imp_c_x
estimate conf.low conf.high
1 1.026666 0.9042858 1.149046
Great! This was the right model, so we would expect this to perform well.
Now what happens if we adjust for c_x
in addition in the outcome model:
mod_double_c_x < with(imp_data_c_x, lm(y ~ x_obs + c_x)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_double_c_x
estimate conf.low conf.high
1 0.9991868 0.7968509 1.201523
The right imputation model with the wrong outcome model is fine!
c_y
Ok first let’s look at the whole dataset.
mod_full_c_y < lm(y ~ x + c_y, data = data) >
tidy(conf.int = TRUE) >
filter(term == "x") >
select(estimate, conf.low, conf.high)
mod_full_c_y
# A tibble: 1 × 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 1.01 0.946 1.08
Looks good! Now let’s do the “complete cases” analysis.
mod_cc_c_y < lm(y ~ x + c_y, data = data_cc) >
tidy(conf.int = TRUE) >
filter(term == "x") >
select(estimate, conf.low, conf.high)
mod_cc_c_y
# A tibble: 1 × 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 1.01 0.909 1.11
Great! It works. This shows that as long as we have the right outcome model we can do complete case analysis even if the data is missing not at random (cool!). Now let’s do some imputation.
imp_data_c_y < mice(
data,
m = 5,
method = "norm.predict",
formulas = list(x_obs ~ c_y),
print = FALSE)
Ok let’s compare how this model does “alone”.
mod_imp_c_y < with(imp_data_c_y,lm(y ~ x_obs)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_imp_c_y
estimate conf.low conf.high
1 0.6888255 0.527796 0.849855
Oh no, very bad! The wrong imputation model is worse than complete case! By a lot! This estimate is off by 0.31. Does conditioning on c_y
help us at all?
mod_double_c_y < with(imp_data_c_y, lm(y ~ x_obs + c_y)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_double_c_y
estimate conf.low conf.high
1 1.009655 0.8887281 1.130582
Phew, the wrong imputation model with the wrong outcome model is back to being fine.
Ok, what if we just had our useless variable, noise
.
mod_full_noise < lm(y ~ x + noise, data = data) >
tidy(conf.int = TRUE) >
filter(term == "x") >
select(estimate, conf.low, conf.high)
mod_full_noise
# A tibble: 1 × 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 0.992 0.898 1.09
This is fine! c_x
and c_y
aren’t confoudners so we can estimate the coefficent for x
without them – noise
doesn’t do anything, but it also doesn’t hurt. What about complete case?
mod_cc_noise < lm(y ~ x + noise, data = data_cc) >
tidy(conf.int = TRUE) >
filter(term == "x") >
select(estimate, conf.low, conf.high)
mod_cc_noise
# A tibble: 1 × 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 0.887 0.748 1.03
Oops! We’ve got bias (as expected!) – we end up with a biased estimate by ~0.11.
What if we build the (wrong) imputation model?
imp_data_noise < mice(
data,
m = 5,
method = "norm.predict",
formulas = list(x_obs ~ noise),
print = FALSE)
Ok let’s compare how this model does “alone”.
mod_imp_noise < with(imp_data_noise, lm(y ~ x_obs)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_imp_noise
estimate conf.low conf.high
1 0.8807755 0.7217368 1.039814
This is also wrong (womp womp!) What if we try two wrong models?
mod_double_noise < with(imp_data_noise,lm(y ~ x_obs + noise)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_double_noise
estimate conf.low conf.high
1 0.8865363 0.7275635 1.045509
Nope 😢. Two wrong models here are not better than one! It’s worse! Womp womp.
Let’s put these all together:
bind_rows(
mod_full_c_x,
mod_full_c_y,
mod_full_noise,
mod_cc_c_x,
mod_cc_c_y,
mod_cc_noise,
mod_imp_c_x,
mod_imp_c_y,
mod_imp_noise,
mod_double_c_x,
mod_double_c_y,
mod_double_noise
) >
mutate(
mod = factor(c("Full data with c_x",
"Full data with c_y",
"Full data with noise",
"Complete case with c_x",
"Complete case wtih c_y",
"Complete case with noise",
"Imputation with c_x",
"Imputation with c_y",
"Imputation with noise",
"Two models with c_x",
"Two models with c_y",
"Two models with noise" ),
levels = c("Full data with c_x",
"Complete case with c_x",
"Imputation with c_x",
"Two models with c_x",
"Full data with c_y",
"Complete case wtih c_y",
"Imputation with c_y",
"Two models with c_y",
"Full data with noise",
"Complete case with noise",
"Imputation with noise",
"Two models with noise" )),
mod = fct_rev(mod),
) > to_plot
ggplot(to_plot, aes(x = estimate, xmin = conf.low, xmax = conf.high, y = mod)) +
geom_pointrange() +
geom_vline(xintercept = 1, lty = 2)
So there you have it, two wrong models are rarely better than one.
In writing this post, I found that I was getting biased results when I was correctly specifying my imputation model when using the {mice}
defaults (which is why the code above specifies norm.predict
for the method, forcing it to use linear regression, as the data were generated). I didn’t understand why this is happening until some helpful friends on Twitter explained it (thank you Rebecca, Julian, and Mario. I’ll show you what is happening and then I’ll show a quick explanation. Let’s try to redo the imputation models using the defaults:
imp_default_c_x < mice(
data,
m = 5,
formulas = list(x_obs ~ c_x),
print = FALSE)
mod_imp_c_x < with(imp_default_c_x, lm(y ~ x_obs)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_imp_c_x
estimate conf.low conf.high
1 0.7353276 0.6194273 0.8512279
Bad!
mod_double_c_x < with(imp_default_c_x, lm(y ~ x_obs + c_x)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_double_c_x
estimate conf.low conf.high
1 0.4602637 0.30023 0.6202974
Even worse!!
imp_default_c_y < mice(
data,
m = 5,
formulas = list(x_obs ~ c_y),
print = FALSE)
mod_imp_c_y < with(imp_default_c_y, lm(y ~ x_obs)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_imp_c_y
estimate conf.low conf.high
1 0.3405789 0.1126282 0.5685295
YIKES!
mod_double_c_y < with(imp_default_c_y, lm(y ~ x_obs + c_y)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_double_c_y
estimate conf.low conf.high
1 0.5128878 0.3216547 0.7041208
Better since we are conditioning on c_y
(but still bad!)
imp_default_noise < mice(
data,
m = 5,
formulas = list(x_obs ~ noise),
print = FALSE)
mod_imp_noise < with(imp_default_noise, lm(y ~ x_obs)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_imp_noise
estimate conf.low conf.high
1 0.4076967 0.2352155 0.5801779
EEK!
mod_double_noise < with(imp_default_noise,lm(y ~ x_obs + noise)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_double_noise
estimate conf.low conf.high
1 0.4124791 0.2497624 0.5751959
Just as bad..
Let’s put those in the original plot:
bind_rows(
mod_full_c_x,
mod_full_c_y,
mod_full_noise,
mod_cc_c_x,
mod_cc_c_y,
mod_cc_noise,
mod_imp_c_x,
mod_imp_c_y,
mod_imp_noise,
mod_double_c_x,
mod_double_c_y,
mod_double_noise
) >
mutate(
mod = factor(c("Full data with c_x",
"Full data with c_y",
"Full data with noise",
"Complete case with c_x",
"Complete case wtih c_y",
"Complete case with noise",
"Default Imputation with c_x",
"Default Imputation with c_y",
"Default Imputation with noise",
"Two models with c_x",
"Two models with c_y",
"Two models with noise" ),
levels = c("Full data with c_x",
"Complete case with c_x",
"Default Imputation with c_x",
"Two models with c_x",
"Full data with c_y",
"Complete case wtih c_y",
"Default Imputation with c_y",
"Two models with c_y",
"Full data with noise",
"Complete case with noise",
"Default Imputation with noise",
"Two models with noise" )),
mod = fct_rev(mod),
) > to_plot
ggplot(to_plot, aes(x = estimate, xmin = conf.low, xmax = conf.high, y = mod)) +
geom_pointrange() +
geom_vline(xintercept = 1, lty = 2)
AHH! This makes me so scared of imputation!!
Rebecca Andridge’s tweet finally helped me see why this is happening. The way the missing data is generated, larger values of c_x
have a higher probability of missingness, and for particularly high values of c_x
that probability is almost 1.
ggplot(data, aes(x = x, y = y, color = factor(x_miss))) +
geom_point() +
geom_vline(xintercept = 2.31, lty = 2) +
labs(color = "missing")
Take a look at the plot above. We have no nonmissing x
values that are greater than 2.3. The way predictive mean matching (the default {mice}
method) works is it finds the observation(s) that have the closest predicted value to the observation that is missing a data point and gives you that nonmissing data point’s value. So here, we are essentially truncating our distribution at 2.3, since that is the highest value observed. Any value that would have been higher is going to be necessarily too small instead of the right value (this is different from the linear model method used in the first part of this post, which allows you to extrapolate). This is supposed to be a less biased approach, since it doesn’t allow you to extrapolate beyond the bounds of your observed data, but it can actually induce bias when you have pockets of missingness with no observed x
s (which I would argue might happen frequently!). Here is an example of one of the imputed datasets, notice nothing is above that 2.3 line!
ggplot(complete(imp_default_c_x), aes(x = x_obs, y = y, color = factor(x_miss))) +
geom_point() +
scale_x_continuous(limits = c(2.8, 3.1)) +
geom_vline(xintercept = 2.31, lty = 2) +
labs(color = "imputed")
y
in the imputation modelsIncluding y
in the imputation model is definitely recommended, as was hammered home for me by the wonderful Frank Harrell, but I’m not sure this recommendation has permeated through the field yet (although this paper reiterating this result just came out yesterday so maybe it is!).
Let’s see how that improves our imputation models:
imp_y_c_x < mice(
data,
m = 5,
formulas = list(x_obs ~ c_x + y),
print = FALSE)
mod_imp_c_x < with(imp_y_c_x, lm(y ~ x_obs)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_imp_c_x
estimate conf.low conf.high
1 1.034138 0.9109341 1.157343
Beautiful!
mod_double_c_x < with(imp_y_c_x, lm(y ~ x_obs + c_x)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_double_c_x
estimate conf.low conf.high
1 1.08074 0.8772936 1.284186
A bit worse, but not bad!
imp_y_c_y < mice(
data,
m = 5,
formulas = list(x_obs ~ c_y + y),
print = FALSE)
mod_imp_c_y < with(imp_y_c_y, lm(y ~ x_obs)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_imp_c_y
estimate conf.low conf.high
1 0.9298631 0.8137157 1.046011
Not bad!! A bit biased but way better.
mod_double_c_y < with(imp_y_c_y, lm(y ~ x_obs + c_y)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_double_c_y
estimate conf.low conf.high
1 1.01812 0.9415964 1.094644
Love it, looks great after conditioning on c_y
imp_y_noise < mice(
data,
m = 5,
formulas = list(x_obs ~ noise + y),
print = FALSE)
mod_imp_noise < with(imp_y_noise, lm(y ~ x_obs)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_imp_noise
estimate conf.low conf.high
1 0.9671593 0.8580147 1.076304
Oo lala, even does well when we don’t have the right model (this makes sense because we are using y
!)
mod_double_noise < with(imp_y_noise, lm(y ~ x_obs + noise)) >
pool() >
tidy(conf.int = TRUE) >
filter(term == "x_obs") >
select(estimate, conf.low, conf.high)
mod_double_noise
estimate conf.low conf.high
1 0.9697639 0.8612499 1.078278
Let’s put those in the original plot:
bind_rows(
mod_full_c_x,
mod_full_c_y,
mod_full_noise,
mod_cc_c_x,
mod_cc_c_y,
mod_cc_noise,
mod_imp_c_x,
mod_imp_c_y,
mod_imp_noise,
mod_double_c_x,
mod_double_c_y,
mod_double_noise
) >
mutate(
mod = factor(c("Full data with c_x",
"Full data with c_y",
"Full data with noise",
"Complete case with c_x",
"Complete case wtih c_y",
"Complete case with noise",
"Imputation with c_x and y",
"Imputation with c_y and y",
"Imputation with noise and y",
"Two models with c_x",
"Two models with c_y",
"Two models with noise" ),
levels = c("Full data with c_x",
"Complete case with c_x",
"Imputation with c_x and y",
"Two models with c_x",
"Full data with c_y",
"Complete case wtih c_y",
"Imputation with c_y and y",
"Two models with c_y",
"Full data with noise",
"Complete case with noise",
"Imputation with noise and y",
"Two models with noise" )),
mod = fct_rev(mod),
) > to_plot
ggplot(to_plot, aes(x = estimate, xmin = conf.low, xmax = conf.high, y = mod)) +
geom_pointrange() +
geom_vline(xintercept = 1, lty = 2)
Pretty good! Maybe I’m feeling a little better about imputation. Maybe. But it had better include the outcome (which I’ll admit feels very weird for my little causal brain).
For some reason, when I tried to answer this question, my first instinct was to try to decide whether the data were missing at random, but it turns out, this might not be the right first question! Why? Complete case analysis will give us unbiased estimates even if the data are not missing at random. Excuse me? Yes. When is this the case? When:
When I was thinking through this, I found it helped to work up a few short simulations, so here we go.
First let’s generate our “true” values. Here the true model is below, which shows that:
In other words, and
library(tidyverse)
n < 1000000
set.seed(1)
data < tibble(
x = rnorm(n),
c = rnorm(n),
y = x + rnorm(n)
)
Let’s force x
to be missing under three scenarios:
c
)data < data >
mutate(
x_obs_mcar = ifelse(
rbinom(n, 1, 0.5),
NA,
x
),
x_obs_mar = ifelse(
rbinom(n, 1, 1 / (1 + exp(c))),
NA,
x
),
x_obs_mnar = ifelse(
rbinom(n, 1, 1 / (1 + exp(x))),
NA,
x
)
)
data >
summarise(mcar_miss = mean(is.na(x_obs_mcar)),
mar_miss = mean(is.na(x_obs_mar)),
mnar_miss = mean(is.na(x_obs_mnar)))
# A tibble: 1 × 3
mcar_miss mar_miss mnar_miss
<dbl> <dbl> <dbl>
1 0.501 0.500 0.501
In all three cases ~50% of the data are missing.
Let’s see how these impact the ability to predict y
(and the estimation of )
lm(y ~ x_obs_mcar, data = data)
Call:
lm(formula = y ~ x_obs_mcar, data = data)
Coefficients:
(Intercept) x_obs_mcar
0.00104 0.99964
Ok, the missing completely at random x
is estimated correctly when we do a complete cases analysis (that is when we listwise delete every row that has missing data). This checks out!
Let’s check out the missing conditionally at random x
. Note that this is missing at random after conditioning on c
(although we aren’t actually doing anything with c
at this point, so it might as well be missing not at random…)
lm(y ~ x_obs_mar, data = data)
Call:
lm(formula = y ~ x_obs_mar, data = data)
Coefficients:
(Intercept) x_obs_mar
0.000482 0.999835
Whoa! Prediction is still correct and the coefficients themselves are still unbiased even if we do complete case analysis.
Ok, what about missing not at random?
lm(y ~ x_obs_mnar, data = data)
Call:
lm(formula = y ~ x_obs_mnar, data = data)
Coefficients:
(Intercept) x_obs_mnar
7.23e05 9.99e01
Still ok! No bias in the coefficients and the predicted values are unbiased even though the predictor was missing not at random (and we did a complete cases analysis).
Why is this the case? Travis Gerke pointed out this excellent paper that has a beautiful plot that demonstrates why this happens. I am going to simulate some smaller data than what I have above to demonstrate the point. Here, the missingess is deterministic (and definitely not random!) – if x
is greater than 0, it is missing, otherwise it isn’t. Again, this renders ~50% of our data missing.
data < tibble(
x = seq(3, 3, by = 0.1),
y = x + rnorm(61),
x_miss = ifelse(x > 0, "yes", "no")
)
ggplot(data, aes(x = x, y = y, color = x_miss)) +
geom_point() +
geom_line(aes(x = x, y = predict(lm(y ~ x, data = data)))) +
geom_vline(xintercept = 0, lty = 2) +
scale_color_manual(values = c("cornflowerblue", "orange")) +
geom_label(aes(x = 2, y = 3), label = "observed data", color = "cornflowerblue") +
geom_label(aes(x = 2, y = 3), label = "missing data") +
theme_minimal() +
theme(legend.position = "none")
Maybe this result is a somewhat obvious result because despite the fact that the missingness was not at random, it had nothing to do with the outcome, y
, let’s see what would happen if c
was related to y
.
set.seed(1)
data < tibble(
x = rnorm(n),
c = rnorm(n),
y = x + c + rnorm(n)
)
Let’s say we have some data that is not missing at random. The probability that x
is missing is dependent both on it’s own value (MNAR!) and the value of c
.
data < data >
mutate(x_obs = ifelse(
rbinom(n, 1, 1 / (1 + exp((x + c)))),
NA,
x
)
)
This results in missing ~ 50% of the values for x
:
data >
count(miss_x = is.na(x_obs))
# A tibble: 2 × 2
miss_x n
<lgl> <int>
1 FALSE 499277
2 TRUE 500723
So what happens if we try to predict y
? Let’s try it first with the full data as a benchmark:
right_mod < lm(y ~ x + c, data = data)
right_mod
Call:
lm(formula = y ~ x + c, data = data)
Coefficients:
(Intercept) x c
0.000162 1.000106 0.999592
Beautiful. we get the correct estimates (and thus the correct predictions for y
).
Now let’s take a look at the complete case analysis:
data_cc < na.omit(data)
cc_mod < lm(y ~ x + c, data = data_cc)
cc_mod
Call:
lm(formula = y ~ x + c, data = data_cc)
Coefficients:
(Intercept) x c
0.00128 1.00032 1.00156
Would you look at that. Correct coefficients! How could we break that? If we misspecified the model, for example if we didn’t adjust for c
:
lm(y ~ x, data = data_cc)
Call:
lm(formula = y ~ x, data = data_cc)
Coefficients:
(Intercept) x
0.419 0.849
But is this a terribly interesting case? If we don’t have a way to predict the missing x
values (MNAR!) we couldn’t impute it anyways, so we would be in trouble imputation or not!
This is the same as above, but I found it helpful to frame as an inference question. What if I have some treatment, x
that is randomly assigned, but is missing based on some factor c
that is related to my outcome of interest, y
.
set.seed(1)
data < tibble(
x = rbinom(n, 1, 0.5), # randomly assigned exposure
c = rnorm(n), # problem variable
y = x + c + rnorm(n), # true treatment effect is 1
x_obs = ifelse(rbinom(n, 1, 1 / (1 + exp((x + c)))),
NA, x) # missing based on x and c
)
First of all, in the absense of any missing data we don’t actually need to adjust for c in order to get an unbiased estimate for x
because c
is not a confounder (yay!).
lm(y ~ x, data = data)
Call:
lm(formula = y ~ x, data = data)
Coefficients:
(Intercept) x
0.000368 0.998409
BUT! With missing data, we do have a problem:
data_cc < na.omit(data)
lm(y ~ x, data = data_cc)
Call:
lm(formula = y ~ x, data = data_cc)
Coefficients:
(Intercept) x
0.413 0.825
Oops! Now we are seeing an effect of 0.825 when the true effect of the exposure is 1 (because we did a bad thing and just deleted all the missing rows). In the past, this is where I would say and therefore you should do imputation! BUT instead, just adjust for c
, that is correctly specify your outcome model and all will be well:
lm(y ~ x + c, data = data_cc)
Call:
lm(formula = y ~ x + c, data = data_cc)
Coefficients:
(Intercept) x c
0.00149 1.00241 1.00161
set.seed(1)
data < tibble(
x = rbinom(n, 1, 0.5), # randomly assigned exposure
c = rnorm(n), # problem variable
y = x + c + x * c + rnorm(n), # true treatment effect varies by c
x_obs = ifelse(rbinom(n, 1, 1 / (1 + exp((x + c)))),
NA, x) # missing based on x and c
)
Complete case analysis is fine as long as the model for the outcome is correctly specified:
data_cc < na.omit(data)
lm(y ~ x + c + x*c, data = data_cc)
Call:
lm(formula = y ~ x + c + x * c, data = data_cc)
Coefficients:
(Intercept) x c x:c
0.00122 1.00149 1.00228 0.99824
Here I have an additional factor c2
that just influences y
:
set.seed(1)
data < tibble(
x = rbinom(n, 1, 0.5), # randomly assigned exposure
c = rnorm(n), # problem variable
c2 = rnorm(n), # some other thing that only influences y
y = x + c + c2 + rnorm(n), # true treatment effect varies by c
x_obs = ifelse(rbinom(n, 1, 1 / (1 + exp((x + c)))),
NA, x) # missing based on x and c
)
Complete case analysis is fine as long as the model for the outcome has the things that influence both the exposure and outcome (confounders) and the things that influence the missing data mechanism and outcome (in this case just c
) if all we care about is the effect of x
on y
:
data_cc < na.omit(data)
lm(y ~ x + c, data = data_cc)
Call:
lm(formula = y ~ x + c, data = data_cc)
Coefficients:
(Intercept) x c
0.000207 1.000115 0.999957
Ok this is where things are a bit trickier (and it is not uncommon to find yourself here, for example loss to followup!)
data < tibble(
x = rnorm(n),
c = rnorm(n),
y = x + c + rnorm(n),
x_obs = ifelse(
rbinom(n, 1, 1 / (1 + exp((x + c)))), # x is missing not at random (based on value of x and c)
NA,
x
),
y_obs = ifelse(
rbinom(n, 1, 1 / (1 + exp((x + c)))), # y is missing (conditionally) at random
NA,
y
)
)
Based on both of these missingness patterns, we are down to ~66% of our data having at least one missing value. AND YET we get unbiased results when we do complete case analysis:
data_cc < na.omit(data)
lm(y ~ x + c, data_cc)
Call:
lm(formula = y ~ x + c, data = data_cc)
Coefficients:
(Intercept) x c
0.00441 1.00213 1.00114
SO when is it a problem? If you are missing y
and y
is missing not at random then you are in trouble (but, I will note that you would also be in trouble in this case if you wanted to do imputation, so I’m not sure this is really a case for anything other than yet another example of a case where statistics cannot save you from everything!). Let’s look at that:
data < tibble(
x = rnorm(n),
c = rnorm(n),
y = x + c + rnorm(n),
x_obs = ifelse(
rbinom(n, 1, 1 / (1 + exp((x + c)))), # x is missing not at random (based on value of x and c)
NA,
x
),
y_obs = ifelse(
rbinom(n, 1, 1 / (1 + exp((x + c + y)))), # y is missing not at random
NA,
y
)
)
Womp womp, here we have a problem, when we do complete case analysis, even with the correctly specified model, we get the wrong answer:
data_cc < na.omit(data)
lm(y ~ x + c, data_cc)
Call:
lm(formula = y ~ x + c, data = data_cc)
Coefficients:
(Intercept) x c
0.352 0.849 0.848
Let’s make another of those cool graphs like they did in that awesome paper explaining missing data in terms of causal inference. Now I am going to make y
missing not at random, as opposed to x
like above.
data < tibble(
x = seq(3, 3, by = 0.1),
y = x + rnorm(61),
y_miss = ifelse(y > 0, "yes", "no")
)
data_cc < data[data$y_miss == "no", ]
library(geomtextpath)
ggplot(data, aes(x = x, y = y, color = y_miss)) +
geom_point() +
geom_textline(color = "black",
aes(x = x, y = predict(lm(y ~ x, data = data))),
label = "truth", hjust = 0.75) +
geom_textline(data = data_cc,
aes(x = x, y = predict(lm(y ~ x, data = data_cc))),
label = "wrong") +
geom_hline(yintercept = 0, lty = 2) +
scale_color_manual(values = c("cornflowerblue", "orange")) +
geom_label(aes(x = 2, y = 3), label = "missing data") +
geom_label(aes(x = 2, y = 3), label = "observed data", color = "cornflowerblue") +
theme_minimal() +
theme(legend.position = "none")
This post is about bias but I would be remiss not to mention the sacrifice in precision that complete case analyses make. It is true that complete case analysis is “throwing away” data, so the standard errors of these estimates will be larger than they would be had we observed the full data set. BUT these standard errors are out of the box “correct” (which is not true if you do something like single imputation, for example!)
This application has a neural network fit to a dataset with one predictor, x, and one outcome, y. The network has one hidden layer with three activations. You can click a “Play” button to watch how the neural network fits across 300 epochs. You can also click on the nodes of the neural network diagram to highlight each of the individual activations across the applciation (along with their corresponding coefficients).
]]>The figures and tables are from our recent preprint: https://arxiv.org/pdf/2304.02683.pdf
Given a single dataset with 3 variables: exposure, outcome and covariate (z) how can statistics help you decide whether to adjust for z? It can’t! For example here, the correlation between z and the exposure in all 4 datasets is 0.7!
So if Stats can’t help what can we do? Well the best thing is just to know the data generating mechanism but that is hard! An easier solution is to make sure to have time varying measurements and only adjust for preexposure covariates! This solves the problem in 3/4 of the sets!
The one it doesn’t solve is Mbias, but as our podcast episode title suggests (MBias: Much Ado About Nothing?) this may be much ado about nothing (give a listen to find out why!) Also…credit to ChatGPT for our episode title 😂
Malcolm Barrett, Travis Gerke, and I have a preprint with details: https://arxiv.org/pdf/2304.02683.pdf
Also the quartets
package includes the datasets if you’d like to play with it yourselves!
I also learned last month about another awesome “causal quartet” by Gelman, Jessica Hullman, and Lauren Kennedy that focuses on treatment heterogeneity so I could use help coming up with a new name for ours!
]]>We conducted a randomized controlled trial to assess whether disclosing elements of uncertainty in an initial public health statement will change the likelihood that participants will accept new, different advice that arises as more evidence is uncovered.
We came up with a hypothetical health scenario and began by asking participants “how likely are you to sanitize your mobile phone?”
(26% said they were likely / very likely)
298 participants were randomized to treatment and 298 control.
controls saw statement 1A, a (hypothetical) recommendation from public health experts that you don’t need to sanitize your mobile phone
treated saw 1B, the same recommendation with added transparent information
After seeing these statements they were asked again their likelihood to sanitize their mobile phone. Then all participants were shown a second, 🆕 statement — this new statement said based on new information public health experts recommend you should sanitize your phone
All participants were again asked after seeing this second statement how likely they were to sanitize their phone (this is our primary endpoint ✅)
We fit proportional odds models stratified by the baseline likelihood to agree with the final advice.
Among participants who were more likely to agree with the final recommendation at baseline, those who were initially shown uncertainty had a 46% lower odds of being more likely to agree with the final recommendation compared to those who were not (OR: 0.54, 95% CI: 0.271.03).
Among participants who were less likely to agree with the final recommendation at baseline, those who were initially shown uncertainty had 1.61 times the odds of being more likely to agree with the final recommendation compared to those who were not (OR: 1.61, 95% CI: 1.152.25)
This has implications for public health leaders when assessing how to communicate a recommendation, suggesting communicating uncertainty influences whether someone will adhere to a future recommendation. Presenting public health recommendations with transparency can both ⬆️ and ⬇️ adherence to future recommendations in the presence of changing evidence, depending on the individuals’ baseline likelihood to do what is being recommended prior to seeing the recommendation. Many public health recommendations, particularly those being made during an ever changing pandemic, have a high probability of changing in the future, making this result highly relevant. For example, the recommendation to 😷 wear masks will come and go depending on the prevalence and impact of the infectious disease of concern in the community. When initially communicating this new recommendation to the public, an explanation of the reasoning for the recommendation as well as any potential uncertainty may have resulted in ⬆️ adherence among those who would not have been likely to wear a mask in the recommended settings. 12% of US respondents in March 2020 reported masking to protect themselves. At the time, the recommendation from the CDC was: “If you are NOT sick: You do not need to wear a face mask unless you are caring for someone who is sick (and they are not able to wear a face mask)”. Based on our results, if the initial recommendation to avoid wearing face masks had been presented with the uncertainty communicated, we might expect some of this 12% to be less likely to follow the recommendation — however! 88% were not likely to wear masks a priori; our results suggest this population may have ultimately had higher compliance with the subsequent recommendation to wear face masks had the initial communication been made with transparency explaining the “why” behind it. This study has lots of limitations (it was only hypothetical!) but I think is a good first step for trying to quantify the impact of the way communications can impact future decisions.
It is super easy to set up a Quarto website. To get the basic template, you can run the following in your terminal:
quarto createproject mysite type website
You can find lots of details about how to customize your site in the Quarto Docs. The rest of this post will cover a few things that made the transition smooth for us.
In Hugo, my .Rmd files were in the following folder under the main project: content/post
. These were often individual files, rather than nested in folders. For my Quarto site, I wanted them in a folder called post
and I wanted each post to have it’s only folder with content in a file called index.qmd
nested within. I wrote a quick R script to help me do this.
library(fs)
# pull all .Rmd files from my blog
files_rmd < list.files(
"~/livefreeordichotomize/content/post",
pattern = "*.Rmd")
# remove the .Rmd for the folders
folders < gsub(".Rmd", "", files_rmd)
# pull again, with full names
full_files < list.files(
"~/livefreeordichotomize/content/post",
pattern = "*.Rmd",
full.names = TRUE)
# create folders
dir_create(glue::glue("posts/{folders}"))
# copy the .Rmd files into a new folder, named according to the old file name
purrr::walk2(
full_files,
folders,
~file_copy(.x, glue::glue("posts/{.y}/index.qmd")))
Since we have rerouted where many of the files are, I set up a 404 page that will allow readers to quickly find a post if they have an old link. To do this, I created a folder in the top of the project called 404.qmd
containing the following:

title: Page Not Found
listing:
contents: posts
type: table
sort: "date"

This will create a searchable table listing of all of the previous posts, allowing readers to quickly find the link they are looking for.
UPDATE You can add the old links under an aliases
parameter in each posts’ YAML.
tipr
R package has some new features! And a new and improved API!
tipr
tipr
is an R package that allows you to conduct sensitivity analyses for unmeasured confounders. Why might you want to do that? Well, as it turns out, the assumption of “no unmeasured confounders” is integral to any estimation of a causal effect. This assumption is untestable, so often the best we can do is examine how far off our estimates would be should an unmeasured confounder exists, hence sensitivity analyses!
tipr
You can install the CRAN version by running the following:
install.packages("tipr")
library(tipr)
The package comes with a few example data sets. For example, the dataframe exdata_rr
is simulated data that can be used to estimate the effect of a binary exposure on a binary outcome, estimated via a risk ratio. This data set has 4 columns:
exdata_rr
# A tibble: 2,000 × 4
.unmeasured_confounder measured_confounder exposure
<dbl> <dbl> <dbl>
1 0.716 0.890 0
2 0.134 0.249 0
3 0.238 0.104 0
4 0.0286 0.165 0
5 0.598 0.812 0
6 0.00419 0.563 0
7 0.960 0.559 0
8 1.15 1.14 0
9 0.374 1.23 0
10 1.80 0.495 0
# ℹ 1,990 more rows
# ℹ 1 more variable: outcome <int>
Using this data, we could estimate the exposureoutcome relationship using the measured confounder as follows:
mod < glm(
outcome ~ exposure + measured_confounder,
data = exdata_rr,
family = poisson)
## calculate the risk ratio by exponentiating
## the coefficient
coef(mod) %>%
exp()
(Intercept) exposure measured_confounder
0.03656318 1.49477100 2.42566275
We observe a risk ratio of 1.5 for the exposure after adjusting for the measured confounder. We can then get a confidence interval for this output. Note that here we are using a generalized linear model with a log link (via the Poisson family) to estimate this risk ratio. When estimating the risk ratio using this method, it is important to estimate the variability using robust standard errors (a sandwich estimator). In R, you can use the sandwich
and lmtest
packages to do this.
lmtest::coefci(mod, vcov = sandwich::vcovHC) %>%
exp()
2.5 % 97.5 %
(Intercept) 0.02778992 0.04810614
exposure 1.10497223 2.02207828
measured_confounder 2.13761161 2.75252986
Our observed effect, after adjusting for our measured confounder is a risk ratio of 1.5 (95% CI: 1.1, 2.0).
Let’s assume our unmeasured confounder is normally distributed with a mean of 0.5 in the exposed group and 0 in the unexposed (and unit variance in both) resulting in a mean difference of 0.5. We can use this to solve for the relationship between the unmeasured confounder and outcome needed to “tip” the analysis (that is needed to make the observed effect, 1.5, cross 1).
To do this, we are going to use the tip_with_continuous
function. We will set the effect_observed
to 1.5 and the exposure_confounder_effect
to 0.5.
tip_with_continuous(
effect_observed = 1.5,
exposure_confounder_effect = 0.5
)
ℹ The observed effect (1.5) WOULD be tipped by 1 unmeasured
confounder with the following specifications:
• estimated difference in scaled means between the
unmeasured confounder in the exposed population and
unexposed population: 0.5
• estimated relationship between the unmeasured confounder
and the outcome: 2.25
# A tibble: 1 × 5
effect_adjusted effect_observed exposure_confounder_effect
<dbl> <dbl> <dbl>
1 1 1.5 0.5
# ℹ 2 more variables: confounder_outcome_effect <dbl>,
# n_unmeasured_confounders <dbl>
The output is a data frame with 5 variables – in this case, we are interested in the confounder_outcome_effect
column, as this tells us the magnitude of the relationship between an unmeasured confounder and outcome needed to tip this analysis. This results in a confounderoutcome effect of 2.25, meaning that a hypothetical unobserved continuous confounder with a mean difference of 0.5
would need a relationship of at least 2.25
with the outcome to tip the analysis at the point estimate.
Alternatively, you could look at a range of potential values for the exposure_confounder_effect
and plot the relationship.
tip_df < tip_with_continuous(
effect_observed = 1.5,
exposure_confounder_effect = seq(0.1, 1, by = 0.1),
verbose = FALSE
)
We could then plot these results:
library(ggplot2)
ggplot(tip_df,
aes(x = exposure_confounder_effect,
y = confounder_outcome_effect)) +
geom_point() +
geom_line() +
labs(x = "Exposure  unmeasured confounder effect",
y = "Unmeasured confounder  outcome effect")
The line represents the values needed for the unobserved confounder to tip this relationship.
Since this data was simulated, we can calculated what the actual effect is.
mod_actual < glm(
outcome ~ exposure + measured_confounder + .unmeasured_confounder,
data = exdata_rr,
family = poisson)
coef(mod_actual) %>%
exp()
(Intercept) exposure
0.02450901 0.92108511
measured_confounder .unmeasured_confounder
2.43654796 2.41680059
lmtest::coefci(mod_actual, vcov = sandwich::vcovHC) %>%
exp()
2.5 % 97.5 %
(Intercept) 0.01801732 0.03333966
exposure 0.68914975 1.23107900
measured_confounder 2.12823267 2.78952863
.unmeasured_confounder 2.10602770 2.77343223
The actual risk ratio is 0.9 (95% CI: 0.7, 1.2) (so null!). The actual relationship between the unmeasured confounder and outcome is 2.4. We can also calculate the actual exposure  unmeasured confounder effect:
exdata_rr %>%
dplyr::group_by(exposure) %>%
dplyr::summarise(m = mean(.unmeasured_confounder))
# A tibble: 2 × 2
exposure m
<dbl> <dbl>
1 0 0.0438
2 1 0.547
The actual difference is 0.5
. Returning to our plot, we can see that this point is to the right of the “tipping” bound, indicating that this unmeasured confounder is “large” enough to tip our result (which is exactly what we saw! Before adjusting for this, we had a risk ratio of 1.5, after adjusting we observe a “tip” (crossing the null, 1) to 0.9).
ggplot(tip_df,
aes(x = exposure_confounder_effect,
y = confounder_outcome_effect)) +
geom_point() +
geom_line() +
annotate(
"point",
x = 0.5,
y = 2.4,
size = 2,
shape = "square",
color = "red"
) +
labs(x = "Exposure  unmeasured confounder effect",
y = "Unmeasured confounder  outcome effect")
The functions in the tipr package follow a unified grammar. The function names follow this form: {action}_{effect}_with_{what}
.
For example, to adjust (action
) a coefficient (effect
) with a binary unmeasured confounder (what
), we use the function adjust_coef_with_binary()
.
Below is a copy of the table included in a recent JOSS article about this package.
Table 1. Grammar of tipr
functions.
category  Function term  Use 

action  adjust 
These functions adjust observed effects, requiring both the unmeasured  confounderexposure relationship and unmeasured confounderoutcome relationship to be specified. 
tip 
These functions tip observed effects. Only one relationship, either the unmeasured confounderexposure relationship or unmeasured confounderoutcome relationship needs to be specified.  
effect  coef 
These functions specify an observed coefficient from a linear, loglinear, logistic, or Cox proportional hazards model 
rr 
These functions specify an observed relative risk  
or 
These functions specify an observed odds ratio  
hr 
These functions specify an observed hazard ratio  
what  continuous 
These functions specify an unmeasured standardized Normally distributed confounder. These functions will include the parameters exposure_confounder_effect and confounder_outcome_effect 
binary 
These functions specify an unmeasured binary confounder. These functions will include the parameters exposed_confounder_prev , unexposed_confounder_prev , and confounder_outcome_effect 

r2 
These functions specify an unmeasured confounder parameterized by specifying the percent of variation in the exposure / outcom explained by the unmeasured confounder. These functions will include the parameters confounder_exposure_r2 and outcome_exposure_r2 
You can find full documentation here: lucymcgowan.github.io/tipr/
GMT: geometric mean titer. This is a measure of the antibody titers. We use the geometric mean because this data is quite skewed (it is also why you typically see it plotted on the log scale). For those of you who ❤️ math, the equation for the geometric mean is just
The primary endpoint of these trials is geomteric mean titer ratio, that is, the ratio between the geometric mean antibody concentration in the younger age groups compared to the 1625 year olds’ geometric mean antibody concentration.
According to the recent writeup from the 511 trial in the New England Journal of Medicine, the trials have been set with two measures of success:
The second criteria was originally set by Pfizer to require that the point estimate of the GMT ratio must be , however after their data lock the FDA requested this to be changed to the higher threshold. While at first glance, this may seem to make sense, after all we often want to make sure that we hold our pediatric trials to a high standard of efficacy, it turns out this change has statistical implications that change the target in ways that are nonstandard for noninferiority trials.
What do I mean? If we believe that the distribution of antibody concentration in children is exactly the same as what we observed in 1625 year olds, we would expect this second criteria to fail 50% of the time. Why? When doing any trial, we are observing a sample, not the whole population. We expect a certain amount of uncertainty in our estimates. Here is a small example. Below I have generated 10 samples of 250 people from a log normal distribution with a mean of 1142.5 (log mean of 7.04) and a log standard deviation of 0.8. I am comparing this to an observed sample of 253 individuals drawn from the exact same distribution (incidentally, there were 253 1625 year olds that led to the “benchmark” of a geometric mean of 1142.5).
Want to try it yourself? You can generate samples from a lognormal distribution in R like this rlnorm(253, mean = 7.040974, sd = 0.8)
. You can then compare the geometric mean ratio across several samples generated from the same distribution.
Notice in the above plot that even though the true GMT ratio between these groups should be 1 (they were drawn from the exact same distribution!), when using the point estimate as the threshold for success, we “failed” 50% of the time. It was a coin toss whether this trial succeeded or failed by this criteria.
The standard error is just the {standard devation} divided by the squareroot of the sample size, ,
This may not be completely intuitive at first glance, but in fact we can show that the probability of “succeeding” under this criteria is driven by how much better the antibody concentration is in children compared to the benchmark – it is linked to the standard error.
This standard error multiplier is just the critical value derived from a standard normal distribution at a given quantile. You can calculate any success probability in R using the qnorm()
function. For example, if we wanted to know the standard error multiplier we’d need to have at least a 70% chance of succeeding by this criteria, you would run qnorm(0.7)
, revealing we’d need to be at least better than the target.
In order to have a probability of success greater than 80%, for example, the childrens’ antibody response would need to be at least standard errors better than the 1625 year olds’. To have a probability of success greater than 90%, the childrens’ antibody response would need to be at least standard errors better than the 1625 year olds’.
As many have pointed out, it is not uncommon for pediatric trials to be held to a higher standard, often requiring efficacy beyond what is required of adults due to an appropriate caution against intervention in an often vulnerable group. I fully believe that the regulators that requested this had every best intention in doing so. In this particular case, however, this type of threshold can potentially lead to the opposite effect. By requiring the younger children to mount a higher antibody response than the older cohort in order to pass regulatory hurdles, we may be inadvertently pushing towards higher dosing, for example.
It is possible that the choice of 1 was made with the full understanding that the truth had to be better than 1 to pass that threshold with any amount of confidence, but it is important that everyone who contributes to decisions about thresholds understands both the math and the rationale for the choice.
Is this why the 24 year old vaccine failed previously? It’s not totally clear since the data hasn’t been released, however based on the tid bits we’ve gotten from media reports, I don’t think so. The New York Times reported, for example, that the 24 vaccine only elicited 60% of the response compared to the 1625 year olds, suggesting that it would have failed by the lower bound criteria alone. So why does this matter? Presumably, these thresholds will be used to compare the post3rd dose response to the 1625 year olds as well – does it make sense to require the under 5s to have a stronger antibody response than the 1625 year olds? Especially with no other option for protection via a vaccine for this age group, I would say no.
]]>The standard error is just the {standard devation} divided by the squareroot of the sample size, ,
In statistics speak, this uncertainty is often quantified as a standard error, a measure that takes into account both the standard deviation of our outcome of interest and the number of people we have measured this outcome on. What do I mean? Here is an example.
GMT: geometric mean titer. This is a measure of the antibody titers. We use the geometric mean because this data is quite skewed (it is also why you typically see it plotted on the log scale). For those of you who ❤️ math, the equation for the geometric mean is just
Let’s say we want to know whether new intervention
is not inferior to benchmark intervention
with respect to some measure, in this case we are going to use the geometric mean titers (GMT), a common measure of antibody response. In this first example, the truth is these two interventions yield the same GMT
, that is, they are equivalent, so new intervention
is not inferior to benchmark intervention
. Let’s say we have a sample of 253 people who received the benchmark intervention. The truth is that GMT
is 1142.5 with a standard deviation of 0.8. We won’t see these exact results in our benchmark intervention group since we’re just looking at one sample of 253. The plot below shows what our 253 folks could look like if we were to collect these data 10 different times. The circles represent the GMT
in each sample and the lines show the 95% confidence intervals.
Note: If we were trying to conduct a hypothesis test to establish whether the mean was different from 1142, and we collected these same 10 samples, the 4th sample would result in a Type 1 error, that is we’d reject the null hypothesis that the mean is equal to 1142 even though that is the truth. When using 95% confidence intervals, we expect to see a Type 1 error 5% of the time.
In reality, we would only see one of these results. Notice how some of these estimates differ from what we know is the truth (a GMT of 1142), for example the 4th sample had a geometric mean of 980 – this is expected. In fact, based on what we know about confidence intervals, if we repeat this process 100 times, we’d expect about 5 of these intervals to not contain 1142 at all. Notice the width of the confidence intervals is generally the same. This is determined by the standard deviation (which here is 0.8) and the sample size (253).
Ok, so let’s say we observed the 10th sample in the plot above (GMT
: 1146.5 (95% CI: 1257.2, 1045.5)). We want to show that some new intervention
yields a GMT
that is not inferior (not worse) than this benchmark intervention. One way we could do that is collect data on the new intervention, compare it to the old, and set some criteria for “success”. What would make this new intervention
not inferior? Let’s see what happens if we set our criteria for success to be that the geometric mean titers among our sample of those in the new intervention
group is not less than the geometric mean we observed in our benchmark intervention
sample. Another way to look at this is to look at the ratio between our two geometric means, the GMT in the new intervention
divided by the GMT in the old intervention
group. If this ratio is equal to one, they are the same, if the ratio is less than 1, the new intervention
is deemed “worse” (inferior) and if the ratio is greater than 1, the new intervention
is better. Another way to state this criteria is that the GMT ratio needs to be .
I am going to generate some new intervention
samples, remember we only observe one of these in reality. I am generating them from the same distribution as the old intervention
, so the truth here is that the two are equal (and thus also noninferior). Remember that to generate these samples we also need to know the sample size. Let’s assume the sample size is the same (253).
The plot above shows 10 random samples from our new intervention
. Remember that these were generated exactly like our old intervention
, so the truth is that they are equivalent (and thus it is also true that the new intervention
is not inferior, not worse, than the old). Notice here, 50% of the samples “failed” (in orange in the plot above) by our criteria that the geometric mean (the point estimate) of our new intervention
had to be greater than or equal to our observed geometric mean in the old intervention
. This is exactly what we’d expect to happen! Because there is some random chance involved (we are only looking at a sample), the probability of falling on either side of a point estimate is going to be a coin toss when the “truth” is centered there.
Often, the criteria for noninferiority trials is set based on a lower bound threshold instead. For example, we could say we want to have some certainty that the new intervention
is not some percent worse than the old. Maybe we set our lower bound to 0.67. This would mean that in order to be deemed a “success” the lower bound of our geometric mean ratio between the old and new interventions would need to exceed 0.67. Looking at the plot above, this seems easy! However, if we assume the same standard deviation (0.8) and same sample size (253), we can achieve this threshold even if the new intervention is statistically significantly worse. Let’s see that.
Now I am going to assume that the truth is that the new intervention
is only 75% as good as the old, meaning the true ratio between the two is 0.75. This would equate to being around 6 standard errors worse than equal, so pretty bad. Let’s generate some samples under this assumption and see how we do.
Our lower bound threshold of 0.67 “caught” our inferior samples ~70% of the time here (in orange), but notice there were three samples that would be considered “successful” by this metric (blue), even though the full confidence interval falls below 1. How likely something like this is to occur depends on two things:
new intervention
groupnew intervention
groupOften clinical trials need to prespecify their sample size, so we can guard against one of these, but the standard deviation is often just a guess. For example, what if the standard deviation in the new intervention
group was much smaller, say half of the standard deviation in the old intervention
group. Let’s see what that looks like.
Notice in the plot above with the smaller standard deviations, the confidence intervals shrunk! Now we have 60% successes, all with upper bounds less than 1 😱. This is why often a second criteria is set to guard against this possibility. In the past, I have often seen this as a criteria for the upper bound – for example, you could require that the upper bound not be less than 1 to exclude the possibility of a “success” when the new intervention is in fact statistically significantly worse. Why not only set an upper bound criterion? If we did that, we could have the opposite scenario where the true standard deviation is much larger than expected, so despite the upper bound being above one, the confidence interval spans a very large range (like below!).
Including both criteria would guard against this. Let’s look at that across all of the scenarios we’ve explored so far.
Remember, I’m showing you 10 different possible sampling scenarios generated under the same “truth”. In reality, we’d only observe one of these. The reason I’m showing 10 is so you can see the probability of success or failure based on the criteria and the truth.
Beautiful! Now we are failing when we should be, regardless of our standard deviation, and succeeding when we should be (when the truth is that the two groups are the same). What we’ve seen so far is similar to how the immunobridging Pfizer trial was designed to evaluate vaccines in kids. Because Pfizer didn’t think they would have enough time or participants to accurately estimate the efficacy of the vaccine in young children, they instead designed a trial to compare their antibody response to the antibody response seen in 1625 year olds, for whom they were able to demonstrate efficacy. The thinking being: if the antibody levels in kids is comprable to that in 1625 year olds, then hopefully the efficacy will be comparable too. The issue, of course, is what do they mean by comparable (or not inferior). Originally they designed this with two thresholds: the lower bound needed to be and the point estimate needed to be . This is similar to scenario 4, but there are important differences, which brings me to scenario 5!
Fixing a noninferiority bound at a point estimate introduces some complications. By not taking the upper bound into account, we don’t fully guard against seeing a statistically significantly worse result that “succeeds” by our criteria. For example, consider the scenario where the “truth” is that the new intervention
is 85% as good as the old. This passes our definition of “noninferiority”, but we could end up with many cases where the new intervention is statistically significantly worse but still “successful”. Much of these decisions need to be grounded both in the statistics and the science. If it is deemed scientifically acceptable that generating 85% of an antibody response is ok, then this is a fine threshold.
This will succeed 90% of the time, but notice that but notice that 8 of the 9 “successes” the confidence interval indicates that the intervention is statistically significantly worse.
It turns out, after Pfizer finished their 511 trial, the FDA requested that they move this second threshold to be more stringent.. Instead of needing a point estimate , they now needed a point estimate .
This is where things get tricky! Recall from the very first scenario that we looked at that setting a threshold based on the point estimate like this will fail 50% of the time even if the truth is that the two groups are exactly equal. By imposing this new threshold, the FDA is essentially requiring that the antibody response in children by superior to that in 1625 year olds, not just not inferior. In addition, setting this threshold like this essentially renders the lower bound moot unless the variability in antibody response is much larger in young children compared to 1625 year olds. In fact, in order for that “lower bound ” threshold to be invoked while the “point estimate ” succeeds, the actual standard deviation in the young children’s antibody titers would need to be about 4 times that observed in the older cohort. This threshold (the need to have the point estimate ) basically sets a large “weight” on the younger antibody response needing to be much stronger than that observed in the older cohort. Let’s take the 24 year olds as an example.
If we think that an antibody response exactly equal to that seen by 1625 year olds is sufficient to deem this vaccine non inferior, then this is not the right threshold, as a trial testing this way will fail 50% of the time. That is, the chance of failing is a coin toss, even if the two groups are exactly equal. Below shows this again. Remember for a given trial, we are only lucky enough to see one of these outcomes in the plot below; I am showing 10 so you can see the variation based on the sampling variability.
Statistically, in order to have a 85% chance of “success”, the truth would need to be that the antibody response in kids is ~1 standard error higher compared to 1625 year olds. This is based on basic normal theory. It turns out if a variable is log normally distributed, the geometric mean is equivalent to the median of the distribution, which ties to the mean of the normal distribution. Therefore, we can use basic normal theory to understand the impact of requiring the point estimate of a distribution to be greater than some value. Assuming that the truth is that this point estimate (the GMT ratio) is 1 x [standard errors], and you take a sample from a log normally distributed population, the probability that the GMT ratio that you observe is 1 is equal to qnorm(x)
. For example, if the truth is that the GMT ratio is 1 [standard errors], and you ran this trial a bunch of times, you would expect a “success” 95% of the time. (The 5% of the time that you saw a “failure” would be considered a Type II error).
SO what does this mean? Well it depends on what the standard deviation ends up being for 24 year old antibody response. If it is the same as 1625 year olds, in order to avoiding making a Type II error more than 15% of the time, the geometric mean of the antibody response in 24 year olds would need to be 56% higher than what was observed in 1625 year olds. If their standard deviation is actually larger this will increase, for example if it was 2, the true antibody response in 24 year olds would need to be ~ 14% higher.
As many have pointed out, it is not uncommon for pediatric trials to be held to a higher standard, often requiring efficacy beyond what is required of adults due to an appropriate caution against intervention in an often vulnerable group. I fully believe that the regulators that requested this had every best intention in doing so. In this particular case, however, this type of threshold can potentially lead to the opposite effect. By requiring the younger children to mount a higher antibody response than the older cohort in order to pass regulatory hurdles, we may be inadvertently pushing towards higher dosing, for example.
Is this why the 24 year old vaccine failed previously? It’s not totally clear since the data hasn’t been released, however based on the tid bits we’ve gotten from media reports, I don’t think so. The New York Times reported, for example, that the 24 vaccine only elicited 60% of the response compared to the 1625 year olds, suggesting that it would have failed by the lower bound criteria alone. So why does this matter? Presumably, these thresholds will be used to compare the post3rd dose response to the 1625 year olds as well – does it make sense to require the under 5s to have a stronger antibody response than the 1625 year olds? Especially with no other option for protection via a vaccine for this age group, I would say no.
And now, a message from President Joe Biden. pic.twitter.com/Q8TglFNBlF
— Saturday Night Live  SNL (@nbcsnl) January 16, 2022
I pulled the domestic box office data from thenumbers.com and put them in spiderman.csv
(if you want to code along, you can do the same! My .csv has two columns: date
and domestic_box_office
) and US COVID19 cases from the NY Times GitHub repository. Here’s a little codealong.
library(tidyverse)
spiderman < read_csv("spiderman.csv",
col_types = cols(
date = col_date(format = "%m/%d/%y"),
domestic_box_office = col_double()
))
cases < read_csv("https://github.com/nytimes/covid19data/raw/master/us.csv")
# get daily cases from cumulative cases
cases < cases %>%
mutate(case = c(cases[1], diff(cases)))
d < spiderman %>%
left_join(cases)
Now, let’s look at the relationship between daily domestic box office and new COVID19 cases in the US.
Warning: The following is a very silly analysis. It is a joke. Please do not try at home!
lm(case ~ domestic_box_office, data = d)
Call:
lm(formula = case ~ domestic_box_office, data = d)
Coefficients:
(Intercept) domestic_box_office
3.559e+05 3.691e03
Oh my heavens! Not only does it not look like ticket sales for SpiderMan: No Way Home are increasing COVID19 cases…it looks like the opposite. According to our very silly model, every $1,000 dollar increase in domestic box office sales results in a decrease of 34 COVID19 cases! Pandemic solved! We just need to buy more SpiderMan: No Way Home tickets!! Since we’re looking at a silly analysis, let’s do something horrible, shall we? A double yaxis!
I just learned about the geomtextpath package from @timelyportfolio on Twitter, it’s amazing!
library(geomtextpath)
d %>%
pivot_longer(cols = c(domestic_box_office, case)) %>%
filter(!is.na(value)) %>%
mutate(value = ifelse(name == "case", value * 120, value),
name = ifelse(name == "domestic_box_office",
"Spiderman daily domestic box office",
"US daily COVID19 cases")) %>%
ggplot(aes(x = date, y = value, color = name)) +
geom_point(alpha = 0.2) +
geom_textsmooth(aes(label = name), hjust = 0.05,
method = "loess", formula = y ~ x) +
scale_y_continuous("Spiderman Daily Domestic Box Office",
label = scales::dollar,
sec.axis = sec_axis(
~ . / 120,
label = scales::comma,
name = "Daily new COVID19 cases in the US")
) +
theme_minimal() +
theme(legend.position = "none") +
labs(caption = "@LucyStats") +
scale_x_date("Date", limits = c(as.Date("20211207"), NA))
The actual SNL sketch claimed there was a one week lag – that SpiderMan sales a week ago caused cases 7 days later. Let’s check that too!
cases < cases %>%
mutate(case_lag = lead(case, 7))
spiderman %>%
left_join(cases) %>%
pivot_longer(cols = c(domestic_box_office, case_lag)) %>%
filter(!is.na(value)) %>%
mutate(value = ifelse(name == "case_lag", value * 120, value),
name = ifelse(name == "domestic_box_office",
"Spiderman daily domestic box office",
"US daily COVID19 cases (7 days later)")) %>%
ggplot(aes(x = date, y = value, color = name)) +
geom_point(alpha = 0.2) +
geom_textsmooth(aes(label = name), hjust = .95,
method = "loess", formula = y ~ x) +
scale_y_continuous("Spiderman Daily Domestic Box Office",
label = scales::dollar,
sec.axis = sec_axis(
~ . / 120,
label = scales::comma,
name = "Daily new COVID19 cases in the US (7 days later)")
) +
theme_minimal() +
theme(legend.position = "none") +
labs(caption = "@LucyStats") +
scale_x_date("Date", limits = c(as.Date("20211207"), NA))
Well there you have it. It looks like SNL was onto something, but got it backwards! Just kidding.
So new claim! a higher SpiderMan: No Way Home domestic box office would lead to fewer COVID19 cases. Let’s talk about how we can think about this claim from a causal perspective if it was presented as I have done here. There are lots of ways to introduce causal thinking; in my class, one of the first ways I like to get my students thinking about causal assumptions is via Hill’s Criteria. These are 9 things to think about when trying to assess whether an association is causal; it is not meant to be a checklist (and the list is not exhaustive!) but I find it is a nice way to dip your toe into causal thinking.
The first criteria is what is the strength of the association. Our goofy linear model suggested that 16% of the variability in the number of daily cases in the US could be explained by the SpiderMan: No Way Home domestic box office. Is this strong? The coefficient in our linear model suggests that every $1,000 dollar increase in domestic box office sales results in a decrease of 34 COVID19 cases! That sounds pretty strong? I’m not sure this criteria is helping me assess whether this relationship is causal so I’m going to move on to the next one and rate strength as 🤷.
Have other studies shown a similar association? I’m going to say no…consistency is going to get rated as 👎.
Specificity means the exposure of interest (in this case SpiderMan domestic box office) only causes one thing (and that thing is a decrease in COVID19 cases in this case). I’m going to go with no again. 👎.
Does the cause precede the effect? Kind of? It looks like sales are decreasing rapidly from Dec 7th through Dec 20 and cases appear to dramatically increase thereafter? But wait! Could the opposite be true? What if the surge in COVID19 cases is causing a decrease in box office sales? (Acutally there may be some merit to this, but also we can just look at box office trends and see that they tend to decrease over time since lots of excited people go the first weekend and fewer as time goes on but I DIGRESS!) I’m going to give this a 🤷.
Do we see a dose effect? Indeed we do! As domestic box office decreases, COVID19 cases increase, so can we infer the opposite, if domestic box office sales were to increase, would we see fewer COVID19 cases? From our little window of data it looks like we can’t know (but reality check, if we had started the analysis from the day before, we actually have box office going from 0 to $121,964,712 and we don’t see a precipitous drop in COVID19 cases, so I’m going to give this one a 👎).
Is this plausible? No, it’s silly. 👎
Is there a coherent argument to be made? Again, no. 👎
Was there a randomized trial // some attempt at estimating a causal effect to explore this relationship? Nope. 👎
Is there strong evidence for a similar exposure (something like SpiderMan box office sales) to cause a similar outcome (a decrease in COVID19 cases in the US)? Nope. 👎
Ok, so using these criteria, let’s assess the overall likelihood of a causal effect:
Strength 🤷
Consistency 👎
Specificity 👎
Temporality 🤷
Biological gradient 👎
Plausibility 👎
Coherence 👎
Experiment 👎
Analogy 👎
I’m going to go ahead and rate this one 👎 not likely causal! Sorry! Looks like increasing box office sales of SpiderMan: No Way Home is not likely to get us out of the pandemic.
Vaccine effectiveness is a relative measure, it tells us how protected you will be relative to an unvaccinated person. Even with delta, this looks ok for infections (and very good for severe illness)
🤒 if an unvaccinated person has a 10% chance of getting sick
💉 and we think the vaccine effectiveness is 60%
💪 a vaccinated person’s chance of getting sick is only 4%
Why? Here’s the math:
Vaccine effectiveness is:
(risk unvaxed  risk vaxed) / (risk unvaxed)
(.10  .04) / .10 = 0.6
or
Risk for vaccinated is:
risk unvaxed  (vaccine effectiveness x risk unvaxed)
.10  (.60 x .10) = .04
But what if there is more virus around / fewer mitigation efforts in place?
With more virus around /fewer mitigation efforts in place, everyone’s risk may increase by 2.5x
🤒 an unvaccinated person has a 25% chance of getting sick
💉 we still think the vaccine effectiveness is 60%
💪 a vaccinated person’s chance of getting sick is 10%
In these scenarios, the vaccine effectiveness didn’t change, but the risk for the vaccinated increased because the overall risk increased. In fact in Scenario 2 the risk to the vaccinated person was equal to the risk of the unvaccinated person in Scenario 1!
Why does this matter? Breakthrough cases are going to be more frequent for this exact reason. It doesn’t mean the vaccine isn’t effective, it means everyone’s baseline risk is increasing because Covid19 is just more prevalent everywhere.
There is hope, though! There are many things we can do to bring that baseline risk back down!
💉 get more people vaccinated
😷 mask up
🧪 make liberal use of testing
⚠️ use caution when possible until things improve
If you see something like x% of the sick/hospitalized/deceased were vaccinated, the better the vaccine uptake the scarier this number will seem! It is using the wrong denominator. For example, here is a scenario with 90% vaccination, 4 people got sick: 2 vaccinated 2 unvaccinated:
In this scenario, 50% of the sick were vaccinated, but this is the wrong metric to look at! It is using the wrong denominator. It doesn’t take into account that 90% of the population is vaccinated (yay!). r tufte::margin_note('Even Bill Gates made this mistake when he said he\'d [rather encounter a shark than a mosquito](https://twitter.com/BillGates/status/1118196606975787008).')
This is called flipping the conditional (and is a problem that we see all the time!). Instead of looking at the probability of being vaccinated given you are sick, you want to look at the probability of being sick given you are vaccinated.
What you need to do is look at the rates among the vaccinated and unvaccinated separately, and then compare them. Here 11% of the vaccinated got sick, 100% of the unvaccinated got sick.
We calculate vaccine efficacy as
(risk among unvaccinated  risk among vaccinated) / risk among unvaccinated
so in this case, 89% (yay!)
So in sum, denominators matter! When scrolling past headlines, be sure to think about what denominators are in play!
]]>library(tidycensus)
library(tidyverse)
library(geofacet)
library(zoo)
The plot shows cases per 100,000 by state, so I first needed to pull population data. To do that I used the tidycensus package. (If you don’t have an API key, you can get one here)
census_api_key("YOUR API KEY")
I pulled the population by state from 2019.
pop < get_acs(geography = "state", variables = "B01003_001", year = 2019)
Then I pulled the cases in from the New York Times GitHub repo.
cases < read_csv("https://github.com/nytimes/covid19data/raw/master/usstates.csv")
These need to be wrangled a bit:
case
for this purposed < cases %>%
group_by(state) %>%
mutate(case = c(cases[1], diff(cases))) %>%
ungroup() %>%
filter(!(date == as.Date("20210308") & state == "Missouri")) %>%
left_join(pop, by = c("fips" = "GEOID")) %>%
group_by(state) %>%
arrange(date) %>%
mutate(
case_7 = rollmean(case, k = 7, fill = NA),
case_per_100 = (case_7 / estimate) * 100000) %>%
ungroup() %>%
filter(date > as.Date("20210131"), date < as.Date("20210405"))
states < tibble(state = state.name,
state_ = state.abb) %>%
add_row(state = "District of Columbia", state_ = "DC")
d < left_join(d, states, by = "state") %>%
filter(!is.na(state_))
This plot had a neat feature that it filled in the area from the lowest point onward; to replicate this I found the date with the minimum cases per 100,000 and created a variable col
to indicate any date after this point.
d < d %>%
group_by(state) %>%
slice_min(case_per_100) %>%
slice(1) %>%
mutate(min_date = date) %>%
select(min_date, state) %>%
left_join(d, by = "state") %>%
mutate(col = ifelse(date >= min_date, "yes", "no"))
Now time to plot! The xaxis is date
, the yaxis is case_per_100
and voila!
ggplot(d, aes(x = date, y = case_per_100)) +
geom_line(color = "#BE2D22") +
geom_area(aes(alpha = col), fill = "#BE2D22") +
scale_alpha_discrete(range = c(0, 0.7)) +
facet_geo(~state_) +
theme_minimal() +
labs(x = "",
y = "",
title = "Cases per 100,000",
subtitle = "Feb 1  Apr 4, Red area indicates rise since lowest point of 2021",
caption = "Note: Shows sevenday average") +
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "none")
Grided is an app that lets you define CSSGrid layouts in a simple GUI allowing you to see how your app will look while you define it. The app is hosted at nickstrayer.me/grided and the source is on github.
I have recently started a job at RStudio on the Shiny team. One of the things that I was brought on to do is “help Shiny user’s build more beautiful apps.” Because of this I have been working on a project allowing users to layout dashboard apps easily with the CSSGrid specification. CSSGrid is a super powerful way of naturally declaring layouts but if you’re like me, the initial setup of getting your app going can be a bit cumbersome. Because of this I almost always reach for a tool like Sarah Drasner’s CSS grid generator and grid.layoutit.
Inspired by both of these resources, I began working on a similar tool that can let Shiny user’s draw their desired layout and then get the code for that layout. After working on this project for a while, however, I realized that what I had built had a few features that I found useful compared to existing tools so I decided to spin off a purely JS/CSS version.
The goal of Grided is to simplify the very first part of building a gridlayout. You specify using a (hopefully) straightforward GUI how you want your app laid out and you get code to implement the structure you defined.
The page starts with an empty grid layout. From here you can adjust the number of rows and columns along with the gap between those rows and columns. As you do this the faux app on the right updates in real time. Each column and row can be resized by a control form. r tufte::margin_note("A fun hidden feature is that when you use the pixelsizing a drag interface appears for resizing.")
For instance you can have a fixedwidth sidebar.
Once you have your desired grid, you add your elements by clicking and dragging them on the grid. Once you have placed an element you provide its ID. Messed up where you placed it? Just drag it around with the supplied interaction handles to move it. r tufte::margin_note("A little dashed box shows you the extent of your drag, while the element snaps to the grid.")
Decide you actually want a footer on your page after declaring a bunch of other stuff? No worries, simply add another row, your elements will stay where they are. Decide you don’t want a footer? Just remove a a row. If elements reside in the deleted row/column you will be asked how you want to update them in a popup.
Happy with your layout and ready to get building your site? Click the “Get code” button in the top right and you will be provided the CSS and HTML code necessary to recreate your layout on your site.
r tufte::margin_note("For the Shiny users, the main version of the app will give you R/Shiny code instead of CSS and HTML.")
There are a few fun facts about the app:
There are a few (to be kind) limitations right now:
fr
, rem
, and px
. CSSGrid has some cool fancy options like minmax()
but I couldn’t figure out how to build an interface like this for these dynamic layouts. r tufte::margin_note("If you can think of one, let me know!")
repeat(autofit,...)
).Here are some examples of my DIY lightboard in action!
* How logging symptoms helps slow the virus spread
* Lucy D’Agostino McGowan discusses peer learning communities
* Law of Iterated Expectation
This summer in an effort to spruce up my home office and online statistics courses, I was in search of a simple solution that would let me walk through equations, as I normally would with a whiteboard in my office, remotely. Enter the lightboard! I love the aesthetic of these! They let you work through fun equations while still letting the viewer see your facial expressions, getting as close to an “in person” interaction as I’ve seen! The downside? They are super pricey! Some sites were quoting over $2,500 for the simpliest models. Even DIY “kits” were coming in at over $1,000  yikes! Then I came across this amazing YouTube video that lays out a simple DIY method that can be put together for much less  in fact, I was able to get my supplies together for less than $100. I put together a quick tweetorial outlining the steps and the supplies I used; I thought I’d copy the contents here, since a blogpost is sometimes a bit easier to read!
r emo::ji("white_large_square")
1/4 x 24 x 36 plexiglass
r emo::ji("bulb")
16ft of LED strip lights
r emo::ji("clamp")
2 one inch Cclamps
r emo::ji("book")
4 bookshelf L brackets
r emo::ji("pen")
neon markers
r emo::ji("bulb")
stick 3ft of the LED strip lights on a table
r emo::ji("point_up")
your plexiglass will sit on top of this
r emo::ji("sandwich")
put your plexiglass between two of the shelf brackets on one side
r emo::ji("clamp")
fasten with the Cclamp
r emo::ji("repeat")
repeat on the other side
r emo::ji("bulb")
wrap the LED strip lights around the remainder of the plexiglass
r emo::ji("point_up")
I used a little tape to hold it on
r emo::ji("selfie")
set up camera
r emo::ji("partying_face")
use fun markers
r emo::ji("left_right_arrow")
flip video to mirror image
1. Teacup Giraffes and Stats [link]
This is a series of fabulous selfpaced interactive modules by Hasse Walum and Desirée De Leon. They begin by walking through the very basics of R and then teach basic statistics concepts such as measures of central tendency and variability, all with adorable illustrations of giraffes!
2. Swirl [link]
This is an R package that helps you learn R from within R. It’s an interactive course that runs from your R console!
3. RYouWithMe [link]
This is a series of online learning resources for using R geared for beginners by Lisa Williams, Jen Richmond, and Danielle Navarro as part of RLadies Sydney.
4. Adventures in R [link]
This is a collegelevel online course taught by Kelly Bodwin focusing on using R for Statistics and Data Science.
This is a learning community started by Jessie Mostipak that intitially went through Garret Grolemund and Hadley Wickham’s R For Data Science book. The group has since expanded to broadly welcome folks interested in learning R and / or improving their skills! The community is centered around a Slack group (join here:) and you can join as a learner or a mentor.
6. Data Science: Foundations using R Specialization [link]
This is a Cousera course by the amazing JHU Data Science Lab team: Jeff Leek, Roger Peng, and Brian Caffo. This course will help you ask the right questions, manipulate data sets, and create visualizations to communicate results in R.
7. RStudio Primers [link]
These are interactive tutorials that teach everything from basic R skills to iterating and writing functions
8. Suggested by Giulio Centorame Ready for R [link]
A course by Ted Laderas meant to be a gentle introduction to using R/Rstudio in your daily work.
1. Creating dynamic dashboards with the shinydashboard R package [link]
An online course (by me!) on the shinydashboard package. This expects you know a bit about R and shiny, but aimed mostly at beginner / intermediate users.
2. Text Mining with R: A Tidy Approach [link]
This Rachael Tatman described this as a “codealong”  I love that description! This book by Julia Silge and David Robinson is a great introduction to text mining the tidy way!
3. MasterClass Featuring Tyler MorganWall: 3D Mapping and Dataviz in R [link]
A videolecture by Tyler MorganWall on making 3D maps in R with rayshader.
4. Suggested by Alireza Akhondiasl Interactive webbased data visualization with R, plotly, and shiny [link]
A book by Carson Sievert describing how to create interactive visualizations in R.
5. Suggested by Lisa Lendway Getting Started with Tidymodels [link]
An set of modules by RStudio focused on the tidymodels package.
6. Suggested by Lisa Lendway and Alex Hayes Tidy machine learning in R [link]
A tutorial codealong by Rebecca Barter on machine learning focusing on the tidymodels package.
Did I miss one of your favorites? Please let me know!
Check out #1 in the Part 1 post. This post focuses on #2, in particular it focuses on competing risks. This criticism was made by Stefanos Bonovas and Daniele Piovani in a letter to the Editor a few days ago.
I believe the authors are interested in telling us about clinical improvement in this cohort of patients taking remdesivir, in particular they want to estimate the cumulative incidence of clinical improvement by 28 days. For the purposes of their analysis “clinical improvement” is defined as being discharged alive or having a decrease of 2 points or more in a 6level ordinal scale of oxygen support:
They use a Kaplan Meier plot to show this. Let’s recreate it first.
I spent some time trying to recreate their analysis using the data from Figure 2, and I wasn’t quite about to do it. So I’ve painstakingly pulled every number from Figure 3A 😅
library(tidyverse)
library(survival)
library(survminer)
library(cowplot)
library(cmprsk)
d < read_csv("https://raw.githubusercontent.com/LucyMcGowan/nejmgreinreanalysis/master/data/datafig2.csv")
fig_3 < tibble(
time = c(4, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 9,
10, 10, 10, 11, 11, 11, 11, 12, 12, 13,
13, 13, 13, 14, 14, 15, 15, 16, 16, 16,
16, 17, 17, 17, 17, 18, 18, 20, 22, 22,
23, 23, 23, 25, 26, 27, 28, 28, 29, 33),
event = c(1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0,
1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1,
0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
0)
)
s < survfit(
Surv(time = time, event = event) ~ 1,
data = fig_3
)
ggsurvplot(fit = s,
risk.table = TRUE,
break.time.by = 4,
risk.table.y.text = FALSE,
ylab = "Days",
tables.theme = theme_cleantable(),
fun = "event")
I haven’t quite figured out how to map these back to Figure 2, but it seems it replicates Figure 3A pretty well. Let’s estimate the cumulative incidence at 28 days:
s_df < tibble(
time = s$time,
cumulative_incidence = 1  s$surv
)
s_df %>%
filter(time == 28)
# A tibble: 1 × 2
time cumulative_incidence
<dbl> <dbl>
1 28 0.841
Looks very similar to the 84% reported in the initial paper. Cool, what’s the problem? The idea main idea is to examine time to clinical improvement and assess at 28 days what the cumulative incidence of improvement is. The way a typical survival analysis (like the ones the authors did here) works is typically you follow a group of patients for a certain amount of time. If they have an event (for example in the analysis above if they improve) you record them as such and the time the event occurred, otherwise you censor them and record the final time they were observed. In the figure here we have 33 patients that improved and 20 that were censored. Sounds fine, what’s the problem? A key assumption is that censoring should be “noninformative”, in other words, the patient is followed for a certain amount of time, never has the event, and then is no longer followed. Our best guess is that we know they didn’t have the event up until the last day we saw them, so we censor them on that day. Maybe they’ll have the event the next day, or maybe in a year  in order for the assumptions to be appropriately met, it must be the case that patients who have been censored are just as likely to have the event as those who are still being followed in the study. Let’s pull up Figure 2 again.
7 of the patients in this cohort died. If someone dies we know that they are not going to improve later. This is not noninformative censoring!! Luckily there is a very straightforward way to deal with this in statistics  competing risks!
In a competing risk analysis, we can separate out the the death outcome from the remaining censored outcomes. We can then appropriately estimate the cumulative incidence of improving. I’ve recoded the 7 deaths (guessed from Figure 2) below.
fig_3_fixed < tibble(
time = c(4, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 9,
10, 10, 10, 11, 11, 11, 11, 12, 12, 13,
13, 13, 13, 14, 14, 15, 15, 16, 16, 16,
16, 17, 17, 17, 17, 18, 18, 20, 22, 22,
23, 23, 23, 25, 26, 27, 28, 28, 29, 33),
event = c(1, 1, 1, 1, 1, 1, 1, 2, 0, 0, 1, 1, 2,
1, 1, 0, 1, 1, 2, 0, 1, 1, 1, 1, 1, 0,
1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 0, 2, 1,
2, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
0)
)
Interested in learning more about competing risk analyses in R? Emily Zabor has an amazing tutorial on this
Let’s see how the plot looks now.
x < cuminc(fig_3_fixed$time,
fig_3_fixed$event,
cencode = 0)
ggcompetingrisks(x,
conf.int = TRUE,
gnames = c("Improvement Improvement", "Death Death"),
ggtheme = theme_classic())
Ok, let’s calculate the cumulative incidence now, taking death into account.
x %>%
map_df(`[`, c("time", "est", "var"), .id = "id") %>%
filter(id %in% c("1 1"), time == 28) %>%
slice(2)
# A tibble: 1 × 4
id time est var
<chr> <dbl> <dbl> <dbl>
1 1 1 28 0.734 0.00612
The estimated cumulative incidence of clinical improvement by day 28 is 73%. Here’s two plots that show the difference.
Why does this matter? It’s crucial that the questions we answer statistically match the ones we are posing in practice. Understanding the underlying assumptions of the models is so important!
It turns out really this is a story about Figure 3B. Why? Because 6/7 of the patients who died were on invasive oxygen support at baseline, so modeling this correctly has the largest impact on the Invasive line on this plot. Here they were examining the same outcome, stratifying by whether patient had invasive or noninvasive baseline oxygen support.
fig_3_fixed < tibble(
time = c(4, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 9,
10, 10, 10, 11, 11, 11, 11, 12, 12, 13,
13, 13, 13, 14, 14, 15, 15, 16, 16, 16,
16, 17, 17, 17, 17, 18, 18, 20, 22, 22,
23, 23, 23, 25, 26, 27, 28, 28, 29, 33),
event = c(1, 1, 1, 1, 1, 1, 1, 2, 0, 0, 1, 1, 2,
1, 1, 0, 1, 1, 2, 0, 1, 1, 1, 1, 1, 0,
1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 0, 2, 1,
2, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
0),
invasive = c(0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1,
0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1,
1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1,
1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1,
1)
)
fit < survfit(
Surv(time, ifelse(event == 1, 1, 0)) ~ invasive,
data = fig_3_fixed
)
ggsurvplot(
fit = fit,
risk.table = TRUE,
break.time.by = 4,
risk.table.y.text = FALSE,
xlim = c(0, 36),
tables.theme = theme_cleantable(),
conf.int = TRUE,
fun = "event",
legend.labs = c("Noninvasive", "Invasive"))
Now what happens if we do a competing risks model? Let’s see what that looks like now
x < cuminc(fig_3_fixed$time,
fig_3_fixed$event,
group = fig_3_fixed$invasive,
cencode = 0)
ggcompetingrisks(x,
conf.int = TRUE,
gnames = c("Noninvasive Improvement", "Invasive Improvement",
"Noninvasive Death", "Invasive Death"),
ggtheme = theme_classic(),
ylim = c(0, 1))
We wouldn’t expect the Noninvasive group to change much, because only one patient in this stratum died. Let’s focus instead on just the Invasive group to compare how that line changes with a proper analysis.
Basically, all this is to say that a competing risk analysis would have been more appropriate here. Hopefully this codethrough has been helpful!
This post focuses on the very neat figure!
Figure 2 in the original paper shows the changes in oxygensupport status from baseline for each of the 53 patients. This figure includes information about:
You can construct a whole dataset from this (and I did!)  you can find it on my GitHub.
Below is code to recreate their Figure 2 using #rstats 😎.
library(tidyverse)
d < read_csv("https://raw.githubusercontent.com/LucyMcGowan/nejmgreinreanalysis/master/data/datafig2.csv")
long_dat < d %>%
pivot_longer(day_1:day_36)
cats < tibble(
value = 1:6,
cat = factor(c("Ambient air", "Lowflow oxygen", "Highflow oxygen", "NIPPV",
"Mechanical ventilation", "ECMO"),
levels = c("ECMO", "Mechanical ventilation", "NIPPV",
"Highflow oxygen", "Lowflow oxygen", "Ambient air"))
)
long_dat %>%
left_join(cats, by = "value") %>%
filter(!is.na(value)) %>%
mutate(day_oxy = as.numeric(gsub("day_", "", name))  1,
day_oxy = ifelse(day_oxy > 28, 28, day_oxy),
day = ifelse(day > 28, 28, day),
patient = factor(patient, levels = 53:1),
event = ifelse(event == "censor", NA, event)
) %>%
ggplot(aes(x = patient, y = day_oxy, fill = cat)) +
geom_segment(aes(x = patient, xend = patient,
y = 0, yend = day  0.5), lty = 3) +
geom_tile(width = 0.5) +
scale_fill_manual("Oxygen support",
values = c("#7D3A2C", "#AA3B2F", "#D36446", "#DEA568",
"#F5D280", "#FCEEBC")) +
geom_point(aes(x = patient, y = day  0.5, shape = event)) +
scale_shape_manual("Event", values = c(15, 5),
labels = c("Death", "Discharge", "")) +
guides(fill = guide_legend(override.aes = list(shape = NA), order = 1)) +
coord_flip() +
labs(y = "day", x = "") +
theme_classic()
I definitely applaud the authors for making this so accessible! Check out Part 2 to see a bit about how their statistics could be improved.
4. Comments and Google analytics
It is straightforward to incorporate comments on your blog – our old site used utterance. Likewise, it is simple to add Google Analytics. To use this in Quarto, we can add the following to the website’s global
yaml
:To see our full setup, check out our GitHub repo: github.com/LFOD/LFOD.github.io
Have other questions? Feel free to leave them in the comments!