# Bayes Theorem and the Probability of Having COVID-19

I’ve seen a few papers describing the characteristics of people who tested positive for SARS-CoV-2 and this is sometimes being interpreted as describing people with certain characteristic’s the **probability of infection**. Let’s talk about why that’s likely not true.

👉 Usually when thinking about estimating the prevalence of a disease, we use the **sensitivity** and **specificity** of the test to help us

👉 The calculations assume that everyone is equally likely to get tested, and with SARS-CoV-2 that is likely not the case

Let’s do some 💭 thought experiments. For these, my goal is to estimate the probability of being infected with 🦠 SARS-CoV-2 given you have 🧩 Disease X

For example,🧩 Disease X could be:

♥️ heart disease

😡 hypertension

➕ it could also be any subgroup (for example age, etc)

In these 💭 thought experiments, we don’t actually have perfect information about who is infected with 🦠 SARS-CoV-2, we just know among those who are 🧪 **tested** who has been infected with 🦠 SARS-CoV-2. This is really the crux of the matter.

For these 💭 thought experiments, assume that the current tests are *perfect* (that is there are 0 false positives and 0 false negatives)

☝️ Note that this is likely not the case, with the current testing framework false positives (+) are unlikely but false negatives (-) may be occurring.

We want the probability of being infected with SARS-CoV-2 given you have Disease X: P(🦠|🧩)

To get this, we need P(🧩|🦠) because based on Bayes’ Theorem we know:

P(🦠|🧩) = P(🧩|🦠)P(🦠) / P(🧩)

BUT, instead of P(🧩|🦠), we actually have P(🧩|🦠, 🧪) - the probability of having disease X given you have SARS-CoV-2 AND you were tested. So the crux of these thought experiments will be trying to get an accurate estimate of P(🧩|🦠) so that we can get back to P(🦠|🧩).

## Thought experiment 1️⃣: Best case scenario

Note: all of these numbers are made up!
🧩 20% of the population has disease X

🦠 50% are infected with SARS-CoV-2
❌ There is no relationship between disease X and SARS-CoV-2

🧪 People with disease X are just as likely to get tested than people without disease X

Why is thought experiment 1️⃣ a best case scenario?

It looks like:

🧪 50% have SARS-CoV-2 infection among those tested

🧪 Of those who tested positive, the prevalence of disease X is 20%

P(🦠|🧩) = 50%

✅ Reality (no relationship between disease X and SARS-CoV-2) matches what we see

## Thought experiment 2️⃣: Oversampling scenario

🧩 20% of the population has disease X
Microbe 50% have SARS-CoV-2 infection

❌ There is no relationship between disease X and SARS-CoV-2

🧪 People with disease X are **2x** more likely to get tested than people without disease X

Why is thought experiment 2️⃣ bad?

It looks like:

🧪 50% have SARS-CoV-2 infection among those tested

🧪 Of those who tested positive for SARS-CoV-2, the prevalence of disease X is 33% 😱

❌ If we plug in what we see (P(🧩|🦠, 🧪) for P(🧩|🦠)), it looks like P(🦠|🧩) is 82.5%, when in reality it is 50%.

## Thought experiment 3️⃣: Undersampling scenario

🧩 20% of the population has disease X

🦠 50% have SARS-CoV-2 infection

❌ There is no relationship between disease X and SARS-CoV-2

🧪 People with disease X are **1/2** as likely to get tested than people without disease X

Why is thought experiment 3️⃣ bad?

It looks like:
🧪 50% have SARS-CoV-2 infection among those tested

🧪 Of those who tested positive for SARS-CoV-2, the prevalence of disease X is 11%

❌ If we plug in what we see (P(🧩|🦠, 🧪) for P(🧩|🦠)), it looks like P(🦠|🧩) is 27.5%, when in reality it is 50%.

## Thought experiment 4️⃣: two problems scenario

🧩 20% of the population has disease X

🦠 56% have SARS-CoV-2 infection

✅ people with disease X are 1.6 times more likely to have SARS-CoV-2 infection, P(🦠|🧩) = 80%

🧪 People with disease X are **5** as likely to get tested than people without disease X

Why is thought experiment 4️⃣ bad?

It looks like:

🦠🧪 66% have SARS-CoV-2 infection among those tested

🧩🧪 Of those who tested positive for SARS-CoV-2, the prevalence of disease X is 66%

❌ We’re getting both the prevalence of SARS-CoV-2 **and** it’s association with Disease X wrong

## How can we fix this?

OKAY, scenarios finished, so hopefully this highlights why we can’t take the prevalence of characteristics in the **tested positive** population as the prevalence of characteristics in the overall population with a SARS-CoV-2 infection. Now, here are tips for how we can correct the numbers.

Scenario 2️⃣: Oversampling by 2x

👉 take those with disease X that tested positive for SARS-CoV-2 and downweight them by a factor of 2.

✅ the adjusted prevalence of Disease X among those that tested positive for SARS-CoV-2 (0.5 / 2.5) = 0.2 (20%)

P(🦠|🧩) = 50%

Scenario 3️⃣: Undersampling by 1/2

👉 take those with disease X that tested positive for SARS-CoV-2 and upweight them by a factor of 2.

✅ the adjusted prevalence of Disease X among those that tested positive for SARS-CoV-2 (2/ 10) = 0.2 (20%)

P(🦠|🧩) = 50%

Scenario 4️⃣: Two problems

For the prevalence of SARS-CoV-2 infections, correct by weighing by the probability of being tested in each subgroup. Here:

🧩 = disease X

❌🧩 = No disease X

P(🦠) = P(🦠 | 🧩) P(🧩) + P(🦠 | ❌🧩) P(❌🧩)

✅P(🦠) = ⅘ * 0.2 + ½ * 0.8 = 56%

Said another way, for calculating the overall prevalence of SARS-CoV-2, this is like downweighting the oversampled Disease X people (divide by 5).

✅ P(🦠) = (⅘ + 2) / (⅘ + 2 + ⅕ + 2) = 0.56

For calculating the prevalence of disease X among those with SARS-CoV-2 infections:

✅ P(🧩 | 🦠) = P(🦠 | 🧩) P(🧩) / P(🦠) = ⅘ * 0.2 / 0.56 = 0.285

Again, downweight the oversampled Disease X population (divide by 5).

✅ P(🧩 | 🦠) = ⅘ / (⅘ + 2) = 0.285

P(🦠 | 🧩) = 80%

Hopefully this is somewhat helpful when reading about characteristics of those who are currently testing positive for SARS-CoV-2. As always, please let me know if there is something I’ve missed! 🙏