Extending the analogy: The boy who cried wolf was p-hacking!

During my postdoc with Jeff Leek, we worked on a few p-value, study design, and p-hacking “explainers”. Two of these were incorporated into TED-Ed cartoons (The totally ironically named (NOT BY ME) This one weird trick will help you spot clickbait and the less ironic Can you spot the problem with these headlines?), but the analogy written about here was never used, so here it is!

Author

Lucy D’Agostino McGowan

Published

August 24, 2019

During my postdoc with Jeff Leek, we worked on a few p-value, study design, and p-hacking "explainers". Two of these were incorporated into TED-Ed cartoons (The totally ironically named (NOT BY ME) This one weird trick will help you spot clickbait and the less ironic Can you spot the problem with these headlines?), but the analogy written about here was never used, so here it is!

r tufte::margin_note("In math-speak this can be written as $p = P(X \\ge x | H_0)$, the probability of seeing something as extreme or more extreme than what was observed, $x$, given the null hypothesis, $H_0$, is true. Here, we're assuming more extreme events are BIG, hence $\\ge$, this is known as a \"right tail\" event. Similarly, we could look to detect extremely small events, \"left tail\" events with $p = P(X \\le x | H_0)$ . Most often, we want to detect extremes in either direction, so a \"two tailed\" event. This is written as $p = 2 \\min \\{P(X \\le x | H_0), P(X \\ge x | H_0)\\}$ BUT I DIGRESS!")

What is a p-value

Let’s start off by discussing what a p-value is. Often scientists want to test whether a hypothesis is true. Many times this hypothesis testing is set up such there there is a “null” hypothesis, in other words there is no relationship between a cause and effect, and an “alternative” hypothesis. A p-value is one way scientists determine the significance of their results. It is a number between 0 and 1 that indicates the probability of observing a result as extreme or more extreme than what you are seeing given the null hypothesis is true. Smaller p-values (often p-values less than 0.05) indicate evidence against the null hypothesis.

What is p-hacking

There once was a village in a small hillside. Here, the villagers all took turns watching the sheep. A young boy named Peter was finally old enough to take a turn, so one night he went out and began his watch. Peter was a restless boy, and easily bored, so within a short time he decided he needed to create some excitement. “Wooooolf,” he cried, “WOOOLF!”. The villagers all came running to his aid to help save the sheep, but when they arrived, there was no wolf to be found. Peter created what is known as a false positive. The villagers thought there was a wolf, when in fact there was not. This can happen in scientific studies when a scientist rejects a null hypothesis that is actually true. r tufte::margin_note(glue::glue("$H_0:$ {emo::ji('no_good_woman')} {emo::ji('wolf')}, $H_A:$ {emo::ji('wolf')}")) Here the “null hypothesis” is that there is not a wolf, and the alternative is that a wolf is present. r tufte::margin_note(glue::glue("In math-speak, the villagers first committed a Type I error, by thinking there WAS a {emo::ji('wolf')}, when in fact there was NOT."))As you may know from the famous Aesop’s Fable, Peter does this again and again, causing the villagers to eventually grow tired of the constant false positives. r tufte::margin_note(glue::glue("This subsequent error is known as a Type II error, thinking there WAS NOT a {emo::ji('wolf')}, when in fact there was WAS {emo::ji('scream')}")) When one day there really is a wolf, Peter cries “Woolf, WOOOLF,” and the villagers do nothing, as they expect that it is another false positive. This time, the villagers commit a false negative error. In a previously unknown version of this tale, the villagers set up a deal with Peter to lay out some ground rules. r tufte::margin_note("More math-speak, this 5% is known as $\alpha$") Peter could only watch the sheep once a month and he could only cry wolf (a false positive) 5% of the time. This is similar to how scientists try to control the chance of getting a false positive by setting up their statistical tests to have a fixed chance of rejecting the null hypothesis when it is true. These statistical tests often assume you are just testing one hypothesis. The first night this was implemented, if Peter cried “WOOLF!” it would be possible, but unlikely that it is a false positive. Because Peter was so bored, however, he soon found ways to get around these rules. Peter didn’t tell the villagers which night he was watching the sheep, so he actually watched them every night. Although he still abided by the 5% rule, since he was going every night the chances that there will be any false positives increased. Problems like this arise if scientists are testing many hypotheses. For example, if they test many hypotheses, all of which are truly null, each with a 5% chance of coming out with a false positive result, they will end up with an overall chance of getting a false positive across all tests much higher than 5%. In fact, they can test so many hypotheses that eventually it is inevitable that they will make a false positive error! This is essentially p-hacking – testing many hypotheses until you see one that is significant.

What are scientists doing about it

There are a couple of ways scientists are combating this potential issue. Instead of testing many things and only reporting the significant finding, scientists can use statistical techniques to account for all of the hypotheses that they have tested. In order to keep scientists on track, there has been a movement to pre-register study hypotheses. This would be like Peter agreeing to tell the villagers which night he would be watching the sheep in advance to ensure that he wasn’t actually watching them every night. In science, this is a way to pre-specify the hypothesis that will be tested, preventing scientists from looking at the data many different ways until they find something to report. An additional way to combat p-hacking is to see if multiple studies have shown the same result, in other words is the result reproducible. In Peter’s case, this would be akin to another villager watching the sheep with him and them both claiming to have seen a wolf. If we see the same result replicated elsewhere, we may be more likely to believe that we aren’t just seeing it by chance. The open science movement has been instrumental in helping with this cause, encouraging data and analysis steps to be made publicly available, eliminating the opportunity to surreptitiously p-hack. A final way to prevent p-hacking is to find ways to uplift scientists and incentivize them to be able to conduct research without the pressure to come up with significant results. There are many think pieces and new organizations working to achieve this mission. With Peter, perhaps we find a way for him to curb his boredom – maybe we buy him a kindle so he can read instead of tormenting his fellow villagers.