What Reproducibility Crisis?

It’s not a great term, the “reproducibility crisis”.

Most don’t think we are actually in a crisis, but I thought that by now practically everyone in various scientific fields had heard of it. In the last few years the reproducibility issue has been covered several times by Nature News, by major newspapers such as the New York Times, and by countless websites, often with truly crisis-level headlines. I first blogged about it in 2012, and it wasn’t new then.

The heart of the issue is that large-scale replication attempts usually fail to reproduce the findings of the original studies. Or at least, they fail to yield the statistically significant finding that the original study did, suggesting either that the original study did not have much statistical power and the original authors got lucky, or it’s a spurious result, or the effect of interest is not very robust to the methodological variation associated with replication attempts. Across preclinical cancer research, economics, and experimental psychology, the results have been similarly depressing. As an editor I’ve been involved in two attempts to replicate individual studies, and both yielded null results (one on ego depletion, and one on how grammatical aspect affects judgments about a criminal).

There are certainly people (Harvard professors, in fact) who still say that “the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%”. But if you had asked me how many psychology academics believe that it is a significant problem I would have said at least half. Experimental psychology in particular has seen a raft of large-scale replication attempts and very public failures to replicate. Before the Reproducibility Project that attempted to replicate 100 studies, there was Many Labs 1. Now, Many Labs 2 has finished data collection and Many Labs 3 is in process. The conversations around replications have reached near-meme levels of rhetoric. If people in any area should have heard of the reproducibility crisis by now, it’s psychology researchers.

I find out about a lot of replication attempts on twitter, with reproducibility news showing up on my feed on a near-daily basis. I was wondering whether the reproducibility crisis is much of a thing to your average psychology academic. Note: your average psychology academic is not on twitter.

In a seemingly self-defeating effort, I set up a poll on twitter of people not on twitter:

I was expecting about five responses. Maybe ten. But I got fifty-eight!

Screen Shot 2016-04-13 at 16.55.18

Let’s acknowledge that this is an unscientific sample with a lot of selection bias. Who knows how these 58 people got their datum? (I say “datum”, not data, because you can only vote once per twitter account).

We have 26% of people who have never heard of the reproducibility crisis, 40% who are skeptical that it’s a problem, and only 34% who think it’s a major problem.

It could be that people saw this as an entertaining opportunity to troll me. I doubt that and suspect that people actually had a real-world collegial interaction as a result of this tweet. They may have avoided colleagues who they’d previously spoken to about reproducibility. And they may have skipped the nose-to-the-grindstone types, continuing on to a colleague with an open office door. That would be pretty good, but it’s also possible they went straight to the prof they know doesn’t keep up with the times.

With biases in mind, let’s consider the numbers. We’ve got 66%, 38 people, who have either never heard of the crisis or are skeptical that it’s a problem. The 95% confidence interval on that figure (adjusted Wald method) runs from 79% down to 55%. That’s a sobering lower limit.

These people are out there, in significant numbers. We ought to keep this in mind when communicating with people about the latest replication failure or the latest call for publishing reform, be it greater disclosure in methods sections or for preregistration. Many people will continue to see these things as burdensome solutions for a problem that may not exist.  Myself, I like to think of preregistration as forcing people to keep their grubby little p-hacking hands from contaminating what could otherwise be a truth-revealing beautiful bit of science.

The positive side is that a very large number of people have become convinced of the value of preregistration in a very short span of time. Already a new experimental psychology journal that publishes only preregistered studies has been created, and at many journals, a new preregistered article type has popped up. Put in perspective, progress has indeed been quite rapid. Consider the molasses-like slog of the open access movement. It took decades of proselytising, explainers, and news for everyone to have a rough idea of what open access is. And still today you find people assuming that the only existing or viable route is author-pays (even though there are thousands of open-access journals that charge authors nothing). Open access is a complex issue, and so are reproducibility issues. It should take a long time to get very far with either.

 

tweets from ResBaz Sydney

problems with “controlling for” variables: quick notes

In science, we frequently see a comparison between two groups of people that differ on multiple demographic variables, say age, IQ, and income, investigating some dependent measures, say body mass index (BMI).

Results are often reported as “the groups had substantially different BMIs, after controlling for X and Y” using ANCOVA or multiple regression. We are given the impression that this analysis shows that the groups would have different BMIs even if they had the same levels of X and Y.

Is this conclusion justified? Maybe. Extensive thought and scrutiny of the data would be required to determine whether this is a reasonable inference. I was thinking of discussing this a bit in my undergraduate teaching, so I asked about it on twitter. A bunch of people both provided helpful responses and asked me to report back.

An obvious problem is that the two groups may differ on other things besides X and Y, many of which you may not have even been measured. So the difference between the groups may be entirely attributable to those confounds. This post is about some less obvious problems. Here are some quick snippets from what people pointed me to.

First, from Miller & Chapman (2001), below. Thanks to @BrandesJanina for pointing me to this paper.

consider a data set in which two groups are older men and younger women, and gender is of interest as an independent variable, Grp. Using age as a covariate does indeed remove age variance. The problem is that, because age and gender are correlated in this data set, removing variance associated with Cov will also remove some (shared) variance due to Grp. Within this data set, there is no way to determine what values of DV men younger than those tested or women older than those tested would have provided. Far from “controlling for” age, the ANCOV A will systematically distort the gender variable. As in our presentation of Lord’s Paradox above, GrPres will not be a valid measure of the construct of gender….

Consider a data set consisting of childrens’ age, height, and weight. If we conduct an ANCOVA in which height is the covariate, age is the grouping variable, and weight is the dependent variable, we are attempting to ask whether younger and older children would differ in weight if they did not happen to differ in height. If the groups indeed do not differ on the covariate, this question can be asked. But if there is something about the construct of age in childhood that inherently involves differences in height, the question makes no sense, because then age with height partialed out would no longer be age. There is no way to “equate” older and younger children on height, because growth is an inherent (not chance or noise) differentiation of the two groups….

Cohen and Cohen (1983) provided the following extreme example: “Consider the fact that the difference in mean height between the mountains of the Himalayan and Catskill ranges, adjusting for differences in atmospheric pressure, is zero!” (p. 425), the point being that one has not in any sense “equated” the two mountain ranges by using atmospheric pressure as a covariate.

Screen Shot 2016-02-03 at 06.56.25.png

-Miller & Chapman (2001)

Let’s go back to my opening example of a BMI difference between two groups, after “controlling for” variables statistically. What if one of those variables controlled for was age? Well, if the two groups were people who exercise and people who don’t, there is very likely variance shared by age and level of exercise, and age likely has a causal influence on exercise (by various routes), so the meaning of the exercise factor is unclear after age has been removed.

The problem of measurement (un)reliability

From Westfall & Yarkoni (submitted):

Suppose we are given city statistics covering a four-month summer period, and observe that swimming pool deaths tend to increase on days when more ice cream is sold. As astute analysts, we immediately identify average daily temperature as a confound: on hotter days, people are more likely to both buy ice cream and visit swimming pools. Using multiple regression, we can statistically control for this confound, thereby eliminating the direct relationship between ice cream sales and swimming pool deaths.

Now consider the following twist. Rather than directly observing recorded daily temperatures, suppose we obtain self-reported Likert ratings of subjectively perceived heat levels. A simulated batch of 120 such observations is illustrated in Figure 1, with the reliability of the subjective heat ratings set to 0.40—a fairly typical level of reliability for a single item in psychology1. Figure 2 illustrates what happens when the error-laden subjective heat ratings are used in place of the more precisely recorded daily temperatures. The simple relationship between ice cream sales and swimming pool deaths (Fig. 2A) is positive and substantial, r(118) = .49, p < .001. When controlling for the subjective heat ratings (Fig. 2B), the partial correlation between ice cream sales and swimming pool deaths is smaller, but remains positive and statistically significant, r(118) = .33, p < .001. Is the conclusion warranted that ice cream sales are a useful predictor of swimming pool deaths, over and above daily temperature? Obviously not. The problem is that subjective heat ratings are a noisy proxy for physical temperature, so controlling for the former does not equate observations on the latter. If we explicitly control for recorded daily temperatures (Fig. 2C), the spurious relationship is eliminated, as we would intuitively expect, r(118) = -.02, p = .81.

Given that most psychological measurements have considerable unreliability (lack of perfect correlation with the construct they are trying to get at), the problem is very general. And it can lead both to spurious conclusions of a relationship as well as spurious conclusions of a non-relationship.

I do not use ANCOVA or GLMs in this way so I may have given a misleading impression with some of what I have written or quoted above. If so, I would love to be corrected.

Bayesian jokes

It’s the end of the year, and I’m indulging myself by posting these Bayesian jokes. The first two were inspired by the #AcademicNoir hashtag.

 

 

 

And one outside the noir domain:

 

 

When retraction is not enough

A study suggesting that “Sadness impairs color perception” reporting two experiments was recently retracted from Psychological Science. But some colleagues and I don’t think the retraction goes far enough.

In the retraction notice, the authors suggested that after revising their second experiment to address the problems that they noted with it, they would seek to re-publish their original Experiment 1 with the revised Experiment 2.

But Experiment 1, and the basic methodology behind both experiments, is shoddy — there are more problems than just those mentioned in the retraction notice. Some of these problems are strange anomalies with the data, specific to Thorstenson et al’s experiments. Other problems, while still significant, are not uncommon in this research area.

When the now-retracted paper first appeared, twitter exploded with criticism, and many documented the study’s problems extensively on their blogs. Five of us got together over email to write a letter to Psychological Science calling for retraction. But before submitting the letter, we contacted the first author, Chris Thorstenson. He eventually told us that he and his colleagues would retract the paper.

When we saw the retraction notice, we noticed that only a few of the problems with the experiments were mentioned. I have been vexed by studies of this ilk for more than three years, and would like to see a general improvement in this research area. So we revised our letter to highlight the additional problems not mentioned by Thorste. We hope that our revised letter will help Thorstenson et al., plus other researchers in this area, to improve their methods.

 

What just happened with open access at the Journal of Vision?

Vision researchers recently received an email from ARVO, the publisher of Journal of Vision, that begins:

On January 1, 2016, Journal of Vision (JOV) will become open access.

But in the view of most, JoV has been open access since its inception! It’s always been an author-pays, free access journal: all articles are published on its website and can be downloaded by anyone.

But free-to-download is not enough for open access, not according to the definition of open access formulated in Budapest in 2001. Open access means (according to this definition) the right not only to download but also to

distribute, … pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers

But JoV, which has always held the copyright of the articles it publishes, says that “All companies, commercial and nonprofit, should contact ARVO directly for permission to reprint articles or parts thereof”.

Starting in 2016, such permission won’t be required.

However, paying your $1,850 for standard publication in JoV in 2016 will not get you everything. The updated Budapest declaration recommends that journals use the license CC-BY.  But JoV‘s publisher has instead chosen to use the license CC BY-NC-ND, , meaning that articles cannot be used commercially (“non-commercial”) and that you can’t distribute bits of the article (“no derivatives”). Yet increasingly today, parts of science involve mining and remixing previously-published data and content, which the ND clause of the license prohibits (unless you get special permission). Education and journalism requires re-use of bits too; think about how many textbooks and articles on the web show just one figure (a “derivative”) or illustration from a scientific paper.

And while the non-commercial, NC clause might sound rather harmless for spreading knowledge, it is sometimes unclear what non-commercial really covers. It may prevent a university, especially private universities, from distributing the article as part of course content that a student pays for (via their tuition).

For these reasons, CC BY is the way we should be going, which is why UK funders like the Wellcome Trust and RCUK require that researchers receiving grants from them publish their articles CC BY. To accommodate this, JoV as part of their new policy will license your article as CC BY, if you pay an additional fee of $500!

What ARVO has done here is only a small step forward for JoV, and unfortunately a rather confusing step. The bigger change has occurred with ARVO’s journal Investigative Ophthalmology and Vision Science (IOVS), which was only accessible via a subscription but starting in 2016, will be CC BY-NC-ND and CC BY.

As you can see, copyright is complicated. Researchers don’t have time to learn all this stuff. And that means recalcitrant publishers (not ARVO, I mean profiteers like Elsevier) can exploit this to obfuscate, complicate, and shift their policies to slow progress towards full open access.

Thanks to Tom Wallis and Irina Harris.

P.S. I think if ARVO had only been changing JoV‘s policies (rather than also the subscription journal IOVS) they wouldn’t have written “JoV will become open access” in that mass email. But because they did, it raised the issue of the full meaning of the term.

P.P.S. Partly because JoV is so expensive, at ECVP there’ll be a discussion of other avenues for open access publishing, such as PeerJ. Go! (I’ll be stuck in Sydney).

Yellow journalism and Manhattan murders

The headline screams “You’re 45% more likely to be murdered in de Blasio’s Manhattan”.

The evidence? Sixteen people have been killed so far this year in Manhattan, against only eleven over the same period last year.

Does this evidence indicate you are more likely to be murdered, as the headline says? To find out, I tested whether a constant murder rate could explain the results. The probability of getting murdered over the same period last year may be approximately 11/Manhattan’s population = 11/1,630,000 = 0.0000674 = .00674%.

Is it likely that with the same murder rate this period this year, one would get a number as high as 16 murders? Yes.

This can be seen by calculating the 95% confidence interval for 11/1,630,000, which according to 3 different statistical methods, spans 5 to 20. That is, even with a constant murder rate, due to statistical fluctuations, the murders over this period could easily have been as low as 5 or as high as 20.  Just like if one flips a coin 10 times, one may get 3 heads the first time and 6 the next, without the chance of a head changing.

Doing this more properly means comparing the two rates directly.  I did this using three different methods, all of which found no significant difference.

The article also reports that the number of shooting incidents is higher this year, 50 instead of 31. Using the three different statistical methods again, this was (barely) significantly different. So here the journalist has a point. But this should be taken with a big grain of salt. Journalists are always looking for “news”, and if they repeatedly look at how many people have been murdered/shot, eventually they are guaranteed to find an apparent difference, because all possible statistical fluctuations will happen eventually.

The statistics and the code are here.

I only did all this and wrote this post because Hal Pashler saw someone tweet the NYPost piece. Hal knew I had previously looked into the statistics of proportions and asked whether the headline was justified. I invite others to disagree with my calculations if they have a better way of doing it. I don’t think different methods will give a very different result, however.