The Replication Crisis: the Six P’s

In a clever bit of rhetoric, Professor Dorothy Bishop came up with “the four horsemen of irreproducibility“: publication bias, low statistical power, p-hacking, and HARKing. In an attempt at more complete coverage of the causes of the replication crisis, here I’m expanding on Dorothy’s four horsemen by adding two more causes, and using different wording. This gives me six P’s of the replication crisis! Not super-catchy, but I think this is useful.

1. For me, P-hacking was always the first thing that came to mind as a reason that many published results don’t replicate. Ideally, when there is nothing to be found in a comparison (such as no real difference between two groups), with the p=0.05 criterion used in many sciences, only 5% of studies will yield a false positive result. However, researchers hoping for a result will try all sorts of analyses to get the p-value to be less than .05, partly because that makes the result much easier to publish. This is p-hacking, and it can greatly elevate the rate of false positives in the literature.

Substantial proportions of psychologists, criminologists, applied linguists and other sorts of researchers admit to p-hacking. Nevertheless, p-hacking may be responsible for only a minority of the failures to successfully replicate previous results. Three of the other p’s below also contribute to the rate of false positives, and while researchers have tried, it’s very hard to sort out their relative importance.

2. Prevarication, which means lying, unfortunately is responsible for some proportion of the positive but false results in the literature. How important is it? Well, that’s very difficult to estimate. Within a psychology laboratory, it is possible to arrange things so that one can measure the rate at which people lie, for example to win additional money in a study, so that helps, but some of the most famous researchers to do so have, well, lied about their findings. And we know that fraudsters work in many research areas, not just dishonesty research. In some areas of human endeavor, regular audits are conducted – but not in science.

3. Publication bias is the tendency of researchers to only publish findings that they find interesting, that were statistically significant, or that confirmed what they expected based on their theoretical perspective. This has resulted in a colossal distortion of reality in some fields, to favor researchers’ pet theories, and resulted in lots of papers about all sorts of phenomena that may not actually exist. Anecdotally, I have heard about psychology laboratories that used to run a dozen studies every semester and only publish the ones that yielded statistically significant results. For those areas where researchers are always testing for something that truly exists (are there any such fields?), publication bias results in inflated estimates of its size.

4. Low statistical power. Most studies in psychology and neuroscience are underpowered, so even if the hypotheses being investigated are true, the chance that any particular study will yield statistically significant evidence for those hypotheses is small. Thus, researchers are used to studies not working, but to get a publication, they know they need a statistically significant result. This can drive them toward publication bias, as well as p-hacking. It also means that attempts to replicate published results often don’t yield a significant result even when the original result is real, making it difficult to resolve the uncertainty about what is real and what is not.

5. A particularly perverse practice that has developed in many sciences is pretending you predicted the results in advance. Also known as HARKing, this gives readers a much higher confidence in published phenomena and theories that they deserve. Infamously, the psychologist Daryl Bem gave students and fellow researchers the following advice:

There are two possible articles you can write: (1) the article you planned to write when you designed your study or (2) the article that makes the most sense now that you have seen the results. They are rarely the same, and the correct answer is (2).

If one follows this advice, with every study the goalpost is moved to match the interesting aspects of the data, even though pure chance is often the only cause of those interesting findings. It’s practices like this, together with publication bias and p-hacking, that are believed to be responsible for Bem’s apparent discovery that ESP is real, which he published in a prestigious social psychology journal.

6. Even when a scientific result reflects a true phenomenon rather than being spurious, it can be difficult to for subsequent researchers to replicate that result. We already ran into this above with the fact that most published studies have low statistical power. Another factor is poor reporting practices (yes, I’m counting this as another ‘p’!). In their papers, researchers often do not describe their study in enough detail for other researchers to be able to duplicate what was done. For example, the Reproducibility Project: Cancer Biology initially aimed to replicate 193 experiments, but none of the experiments were described in sufficient detail in the original paper to enable the researchers to design protocols to repeat the experiments, and for 32% of the associated papers, the authors never responded to inquiries or declined to share reagents, code, or data.

The six P’s don’t exhaust the reasons for poor reproducibility. Simple errors, for example, are another cause, and such errors are surely committed both by original researchers and by replicating researchers (although replication studies seem to be held to a higher standard by journal editors and reviewers than are original studies).

Many steps have been suggested to improve the dire situation that the 6 P’s (and more) have led to. At the most relevant places for science, however, such as journals and universities, these measures are often ignored or adopted only grudgingly, so there remains a long way to go.

Leave a comment