Bayes Theorem and the Probability of Having COVID-19

I’ve seen a few papers describing the characteristics of people who tested positive for SARS-CoV-2 and this is sometimes being interpreted as describing people with certain characteristic’s the probability of infection. Let’s talk about why that’s likely not true.

👉 Usually when thinking about estimating the prevalence of a disease, we use the sensitivity and specificity of the test to help us
👉 The calculations assume that everyone is equally likely to get tested, and with SARS-CoV-2 that is likely not the case

Let’s do some 💭 thought experiments. For these, my goal is to estimate the probability of being infected with 🦠 SARS-CoV-2 given you have 🧩 Disease X

For example,🧩 Disease X could be:

♥️ heart disease
😡 hypertension
➕ it could also be any subgroup (for example age, etc)

In these 💭 thought experiments, we don’t actually have perfect information about who is infected with 🦠 SARS-CoV-2, we just know among those who are 🧪 tested who has been infected with 🦠 SARS-CoV-2. This is really the crux of the matter.

For these 💭 thought experiments, assume that the current tests are perfect (that is there are 0 false positives and 0 false negatives)

☝️ Note that this is likely not the case, with the current testing framework false positives (+) are unlikely but false negatives (-) may be occurring.

We want the probability of being infected with SARS-CoV-2 given you have Disease X: P(🦠|🧩)

To get this, we need P(🧩|🦠) because based on Bayes’ Theorem we know:

P(🦠|🧩) = P(🧩|🦠)P(🦠) / P(🧩)

BUT, instead of P(🧩|🦠), we actually have P(🧩|🦠, 🧪) - the probability of having disease X given you have SARS-CoV-2 AND you were tested. So the crux of these thought experiments will be trying to get an accurate estimate of P(🧩|🦠) so that we can get back to P(🦠|🧩).

Thought experiment 1️⃣: Best case scenario

Note: all of these numbers are made up! 🧩 20% of the population has disease X
🦠 50% are infected with SARS-CoV-2 ❌ There is no relationship between disease X and SARS-CoV-2
🧪 People with disease X are just as likely to get tested than people without disease X

Why is thought experiment 1️⃣ a best case scenario?

It looks like:

🧪 50% have SARS-CoV-2 infection among those tested
🧪 Of those who tested positive, the prevalence of disease X is 20%
P(🦠|🧩) = 50%

✅ Reality (no relationship between disease X and SARS-CoV-2) matches what we see

Thought experiment 2️⃣: Oversampling scenario

🧩 20% of the population has disease X Microbe 50% have SARS-CoV-2 infection
❌ There is no relationship between disease X and SARS-CoV-2
🧪 People with disease X are 2x more likely to get tested than people without disease X

Why is thought experiment 2️⃣ bad?

It looks like:

🧪 50% have SARS-CoV-2 infection among those tested
🧪 Of those who tested positive for SARS-CoV-2, the prevalence of disease X is 33% 😱

❌ If we plug in what we see (P(🧩|🦠, 🧪) for P(🧩|🦠)), it looks like P(🦠|🧩) is 82.5%, when in reality it is 50%.

Thought experiment 3️⃣: Undersampling scenario

🧩 20% of the population has disease X
🦠 50% have SARS-CoV-2 infection
❌ There is no relationship between disease X and SARS-CoV-2
🧪 People with disease X are 1/2 as likely to get tested than people without disease X

Why is thought experiment 3️⃣ bad?

It looks like: 🧪 50% have SARS-CoV-2 infection among those tested
🧪 Of those who tested positive for SARS-CoV-2, the prevalence of disease X is 11%

❌ If we plug in what we see (P(🧩|🦠, 🧪) for P(🧩|🦠)), it looks like P(🦠|🧩) is 27.5%, when in reality it is 50%.

Thought experiment 4️⃣: two problems scenario

🧩 20% of the population has disease X
🦠 56% have SARS-CoV-2 infection
✅ people with disease X are 1.6 times more likely to have SARS-CoV-2 infection, P(🦠|🧩) = 80%
🧪 People with disease X are 5 as likely to get tested than people without disease X

Why is thought experiment 4️⃣ bad?

It looks like:

🦠🧪 66% have SARS-CoV-2 infection among those tested
🧩🧪 Of those who tested positive for SARS-CoV-2, the prevalence of disease X is 66%

❌ We’re getting both the prevalence of SARS-CoV-2 and it’s association with Disease X wrong

How can we fix this?

OKAY, scenarios finished, so hopefully this highlights why we can’t take the prevalence of characteristics in the tested positive population as the prevalence of characteristics in the overall population with a SARS-CoV-2 infection. Now, here are tips for how we can correct the numbers.

Scenario 2️⃣: Oversampling by 2x

👉 take those with disease X that tested positive for SARS-CoV-2 and downweight them by a factor of 2.

✅ the adjusted prevalence of Disease X among those that tested positive for SARS-CoV-2 (0.5 / 2.5) = 0.2 (20%)

P(🦠|🧩) = 50%

Scenario 3️⃣: Undersampling by 1/2

👉 take those with disease X that tested positive for SARS-CoV-2 and upweight them by a factor of 2.

✅ the adjusted prevalence of Disease X among those that tested positive for SARS-CoV-2 (2/ 10) = 0.2 (20%)

P(🦠|🧩) = 50%

Scenario 4️⃣: Two problems

For the prevalence of SARS-CoV-2 infections, correct by weighing by the probability of being tested in each subgroup. Here:

🧩 = disease X
❌🧩 = No disease X

P(🦠) = P(🦠 | 🧩) P(🧩) + P(🦠 | ❌🧩) P(❌🧩)

✅P(🦠) = ⅘ * 0.2 + ½ * 0.8 = 56%

Said another way, for calculating the overall prevalence of SARS-CoV-2, this is like downweighting the oversampled Disease X people (divide by 5).

✅ P(🦠) = (⅘ + 2) / (⅘ + 2 + ⅕ + 2) = 0.56

For calculating the prevalence of disease X among those with SARS-CoV-2 infections:

✅ P(🧩 | 🦠) = P(🦠 | 🧩) P(🧩) / P(🦠) = ⅘ * 0.2 / 0.56 = 0.285

Again, downweight the oversampled Disease X population (divide by 5).

✅ P(🧩 | 🦠) = ⅘ / (⅘ + 2) = 0.285

P(🦠 | 🧩) = 80%

Hopefully this is somewhat helpful when reading about characteristics of those who are currently testing positive for SARS-CoV-2. As always, please let me know if there is something I’ve missed! 🙏


Lucy D'Agostino McGowan image
Lucy D'Agostino McGowan

Currently excited about: observational study methods, translational research, BB-8

comments powered by Disqus