#academicNoir memes

When #AcademicNoir trended on Twitter, I had fun making a few memes about science publishing and the Registered Replication Reports that we started at Perspectives on Psychological Science.

Screen Shot 2016-09-05 at 14.09.02.png

“Who’s shaking you down?” I asked.

“Elsevier,” whispered the librarian.

I showed her the door.  I still have to work in this town. 


Any one of you go it alone, he’ll say you messed up. But if we first get him to approve the protocol, and then all run the replication together…

Screen Shot 2016-09-05 at 14.10.45.png

Well, Mr. P. Hacker, your luck is about to run out.

Down at the journals, they’re running a new game. They call it ‘preregistration’.


“How about we just delete these two data points?”

She looked shocked.

“Listen- you want to get this published, or not?”

They called him “Big Pharma”. Really shady character. Really knew how to make a data set go missing.

We know you’re in there. We’ve got your lab surrounded.

OK, I’ll come out.

Don’t move! Just email us the data. The *raw* data.

He had a good run, for a while. But then he got an email from . Asking for the raw data. His time was up.

“But you haven’t even seen my numbers yet!” she said.

“Just give Dr. Hacker a little time alone with them,” I told her.

The attentional blink and temporal selection


While our retinas process the world around us continuously with little to no interruption, certain mental processes are very limited. One way to show this it to present someone with a rapid stream of letters, one at a time in the center of the screen, at a rate of perhaps 10 letters per second. Give them the task of reporting one or more letters, with each target letter to report designated by presenting a circle around it.

When only one letter is designated for report, the task is pretty easy. When two letters are designated for report and they are far apart in time (say, a full second apart), performance is also high. But when the second letter appears within several hundred milliseconds of the first, something strange happens. Often one feels one not did not see the second ring or letter at all. This is the attentional blink (AB).

For the last twenty years, researchers have investigated which aspects of mental processing are impaired for the second, seemingly unseen target. For many years, the leading theory was that the second target is successfully selected by attention but that later stages are fully occupied with the first target, so that the second target decays before being fully processed.

More recent theories suggest that not only is post-selection target processing impaired, but selection itself is also impaired. After selecting the first target, the attempt to select the second target goes awry. 

The idea that selection is disrupted has had legs, empirically and theoretically. Yet we know surprisingly little about the temporal characteristics of selection even when it succeeds. Vul, Nieuwenstein, & Kanwisher (2008) made a start at remedying this by pointing out that temporal selection could be thought of as having two different aspects – latency and precision.

Latency is the delay in time between the appearance of the target and the stimulus that is ultimately sampled from the scene and reported. Precision is the variability of this latency on different occasions. The attentional blink might affect either, both, or neither of these parameters. To understand how it could be neither, consider that the blink might prevent selection independently of whether or not its time-course is affected.

Vul, Nieuwenstein, & Kanwisher (2008) had a go at quantifying latency and precision. Their methodology was flawed (all the details are in our in-press manuscript), but their basic idea was sound. We applied it to six different datasets from five different labs, and found a similar pattern of results in each. I’ll tell you some of the highlights.

The basic experimental set-up is to present a rapid stream of letters or objects with two of the stimuli indicated as targets, e.g. by presenting a circle around them. The task in this case is to report the two stimuli that were circled.

Sometimes participants report both stimuli correctly and sometimes they get one or both wrong, and this varies with the amount of time between the targets. Traditionally, such percent-correct measures are where the data analysis ends. But Vul et al. had the bright idea of scrutinising the nature of participants’ incorrect responses. Some errors appear to be completely random, but frequently a participant will report the stimulus a few items before or after the target. These near-misses are conspicuous on the serial position error histogram, below, in which the response on every trial is coded relative to the target. If a participant reports the item two before the target, we add it to the “-2” bar of the histogram. If the item reported occurred three items after the target, it’s coded as “3”, and so on.


The first thing that jumps out is that the error distribution is rather symmetrical, meaning that there are approximately as many items reported that occurred before the target as after (in the histogram pictured above, it looks somewhat positively skewed, but that’s largely because the mean falls between position zero and position one, while the histogram bars correspond to whole numbers).

The near-symmetry in time suggests that temporal selection works in a particular way. Before looking at data like these, we assumed that when the circle appeared, it triggered attentional sampling of a stimulus from the scene. If that occurred, however, one would expect a histogram with a substantial positive skew, as variability in time until sampling could widen the distribution only forward in time. Our finding of nearly as many reports from before the target as after hints that rather the cue triggering the commencement of sampling from the stream, the stream is being sampled all along. Which item is ultimately reported may be determined by a subsequent process that binds the ring with one of the persisting representations evoked by the stimuli in the stream.

Which brings me to another remarkable thing: the temporal precision of selection is not affected by the blink. That is, the spread of the histogram, once random guesses were subtracted out, is the same for T1 and T2, for short lags and for long. In unpublished experiments, we found that regardless of stream presentation rate, it is around 80-100 ms (standard deviation of a fitted normal distribution), depending on the participant and the experiment. Looks like the binding process is something that attentional blink theories ought to theorise about.

Unlike the precision of selection, the latency was quite affected by the blink. At the intervals between the first and second target that yield a blink, selection was quite delayed. That is, the histogram was centered on 1-2 items after the target. Except for very short intervals, where it seemed only a single histogram was present, indicating that people manage to sample both targets from a single attentional episode (the explanation we favor for “lag-1 sparing”). This supports those theories that include a disruption to selection, although none have yet addressed temporal precision.

While the sheer number of studies and findings in the attentional blink literature is quite intimidating, a number of theorists have risen to the challenge, developing computational theories that purport to fit the most important empirical results (e.g. 12, 3). As a result of such explicitness, many theories in this area have strong potential to be disconfirmed. One thing I’d like to understand is whether any can explain the contrast between our attentional blink results and some results we found previously (paper, blogpost).

In that previous work, we found that not just precision but also latency could be robust to the demands of encoding a second target. We presented two streams concurrently and on some trials presented a target in each stream. Both targets appeared at the same time. In those circumstances, participants’ performance decreased substantially in the two-target condition, but precision and latency were both unaffected. That the second target impaired performance by approximately as much as it is during an attentional blink, while the properties of selection were unaffected, seems to be a problem for theories that attribute the blink largely to the disruption of selection.

Patrick Goodbourn deserves most of the credit for this work, plus a badge, or maybe two, for putting everything together as open data and open materials (we actually are getting two badges, thanks to the policy of Psychological Science to reward open practices!).

Goodbourn PT, Martini P, Harris I, Livesey E, Barnett-Cowan M, & Holcombe AO (in press). Reconsidering temporal selection in the attentional blink. Psychological Science (postprint, data).

What Reproducibility Crisis?

It’s not a great term, the “reproducibility crisis”.

Most don’t think we are actually in a crisis, but I thought that by now practically everyone in various scientific fields had heard of it. In the last few years the reproducibility issue has been covered several times by Nature News, by major newspapers such as the New York Times, and by countless websites, often with truly crisis-level headlines. I first blogged about it in 2012, and it wasn’t new then.

The heart of the issue is that large-scale replication attempts usually fail to reproduce the findings of the original studies. Or at least, they fail to yield the statistically significant finding that the original study did, suggesting either that the original study did not have much statistical power and the original authors got lucky, or it’s a spurious result, or the effect of interest is not very robust to the methodological variation associated with replication attempts. Across preclinical cancer research, economics, and experimental psychology, the results have been similarly depressing. As an editor I’ve been involved in two attempts to replicate individual studies, and both yielded null results (one on ego depletion, and one on how grammatical aspect affects judgments about a criminal).

There are certainly people (Harvard professors, in fact) who still say that “the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%”. But if you had asked me how many psychology academics believe that it is a significant problem I would have said at least half. Experimental psychology in particular has seen a raft of large-scale replication attempts and very public failures to replicate. Before the Reproducibility Project that attempted to replicate 100 studies, there was Many Labs 1. Now, Many Labs 2 has finished data collection and Many Labs 3 is in process. The conversations around replications have reached near-meme levels of rhetoric. If people in any area should have heard of the reproducibility crisis by now, it’s psychology researchers.

I find out about a lot of replication attempts on twitter, with reproducibility news showing up on my feed on a near-daily basis. I was wondering whether the reproducibility crisis is much of a thing to your average psychology academic. Note: your average psychology academic is not on twitter.

In a seemingly self-defeating effort, I set up a poll on twitter of people not on twitter:

I was expecting about five responses. Maybe ten. But I got fifty-eight!

Screen Shot 2016-04-13 at 16.55.18

Let’s acknowledge that this is an unscientific sample with a lot of selection bias. Who knows how these 58 people got their datum? (I say “datum”, not data, because you can only vote once per twitter account).

We have 26% of people who have never heard of the reproducibility crisis, 40% who are skeptical that it’s a problem, and only 34% who think it’s a major problem.

It could be that people saw this as an entertaining opportunity to troll me. I doubt that and suspect that people actually had a real-world collegial interaction as a result of this tweet. They may have avoided colleagues who they’d previously spoken to about reproducibility. And they may have skipped the nose-to-the-grindstone types, continuing on to a colleague with an open office door. That would be pretty good, but it’s also possible they went straight to the prof they know doesn’t keep up with the times.

With biases in mind, let’s consider the numbers. We’ve got 66%, 38 people, who have either never heard of the crisis or are skeptical that it’s a problem. The 95% confidence interval on that figure (adjusted Wald method) runs from 79% down to 55%. That’s a sobering lower limit.

These people are out there, in significant numbers. We ought to keep this in mind when communicating with people about the latest replication failure or the latest call for publishing reform, be it greater disclosure in methods sections or for preregistration. Many people will continue to see these things as burdensome solutions for a problem that may not exist.  Myself, I like to think of preregistration as forcing people to keep their grubby little p-hacking hands from contaminating what could otherwise be a truth-revealing beautiful bit of science.

The positive side is that a very large number of people have become convinced of the value of preregistration in a very short span of time. Already a new experimental psychology journal that publishes only preregistered studies has been created, and at many journals, a new preregistered article type has popped up. Put in perspective, progress has indeed been quite rapid. Consider the molasses-like slog of the open access movement. It took decades of proselytising, explainers, and news for everyone to have a rough idea of what open access is. And still today you find people assuming that the only existing or viable route is author-pays (even though there are thousands of open-access journals that charge authors nothing). Open access is a complex issue, and so are reproducibility issues. It should take a long time to get very far with either.


tweets from ResBaz Sydney

problems with “controlling for” variables: quick notes

In science, we frequently see a comparison between two groups of people that differ on multiple demographic variables, say age, IQ, and income, investigating some dependent measures, say body mass index (BMI).

Results are often reported as “the groups had substantially different BMIs, after controlling for X and Y” using ANCOVA or multiple regression. We are given the impression that this analysis shows that the groups would have different BMIs even if they had the same levels of X and Y.

Is this conclusion justified? Maybe. Extensive thought and scrutiny of the data would be required to determine whether this is a reasonable inference. I was thinking of discussing this a bit in my undergraduate teaching, so I asked about it on twitter. A bunch of people both provided helpful responses and asked me to report back.

An obvious problem is that the two groups may differ on other things besides X and Y, many of which you may not have even been measured. So the difference between the groups may be entirely attributable to those confounds. This post is about some less obvious problems. Here are some quick snippets from what people pointed me to.

First, from Miller & Chapman (2001), below. Thanks to @BrandesJanina for pointing me to this paper.

consider a data set in which two groups are older men and younger women, and gender is of interest as an independent variable, Grp. Using age as a covariate does indeed remove age variance. The problem is that, because age and gender are correlated in this data set, removing variance associated with Cov will also remove some (shared) variance due to Grp. Within this data set, there is no way to determine what values of DV men younger than those tested or women older than those tested would have provided. Far from “controlling for” age, the ANCOV A will systematically distort the gender variable. As in our presentation of Lord’s Paradox above, GrPres will not be a valid measure of the construct of gender….

Consider a data set consisting of childrens’ age, height, and weight. If we conduct an ANCOVA in which height is the covariate, age is the grouping variable, and weight is the dependent variable, we are attempting to ask whether younger and older children would differ in weight if they did not happen to differ in height. If the groups indeed do not differ on the covariate, this question can be asked. But if there is something about the construct of age in childhood that inherently involves differences in height, the question makes no sense, because then age with height partialed out would no longer be age. There is no way to “equate” older and younger children on height, because growth is an inherent (not chance or noise) differentiation of the two groups….

Cohen and Cohen (1983) provided the following extreme example: “Consider the fact that the difference in mean height between the mountains of the Himalayan and Catskill ranges, adjusting for differences in atmospheric pressure, is zero!” (p. 425), the point being that one has not in any sense “equated” the two mountain ranges by using atmospheric pressure as a covariate.

Screen Shot 2016-02-03 at 06.56.25.png

-Miller & Chapman (2001)

Let’s go back to my opening example of a BMI difference between two groups, after “controlling for” variables statistically. What if one of those variables controlled for was age? Well, if the two groups were people who exercise and people who don’t, there is very likely variance shared by age and level of exercise, and age likely has a causal influence on exercise (by various routes), so the meaning of the exercise factor is unclear after age has been removed.

The problem of measurement (un)reliability

From Westfall & Yarkoni (submitted):

Suppose we are given city statistics covering a four-month summer period, and observe that swimming pool deaths tend to increase on days when more ice cream is sold. As astute analysts, we immediately identify average daily temperature as a confound: on hotter days, people are more likely to both buy ice cream and visit swimming pools. Using multiple regression, we can statistically control for this confound, thereby eliminating the direct relationship between ice cream sales and swimming pool deaths.

Now consider the following twist. Rather than directly observing recorded daily temperatures, suppose we obtain self-reported Likert ratings of subjectively perceived heat levels. A simulated batch of 120 such observations is illustrated in Figure 1, with the reliability of the subjective heat ratings set to 0.40—a fairly typical level of reliability for a single item in psychology1. Figure 2 illustrates what happens when the error-laden subjective heat ratings are used in place of the more precisely recorded daily temperatures. The simple relationship between ice cream sales and swimming pool deaths (Fig. 2A) is positive and substantial, r(118) = .49, p < .001. When controlling for the subjective heat ratings (Fig. 2B), the partial correlation between ice cream sales and swimming pool deaths is smaller, but remains positive and statistically significant, r(118) = .33, p < .001. Is the conclusion warranted that ice cream sales are a useful predictor of swimming pool deaths, over and above daily temperature? Obviously not. The problem is that subjective heat ratings are a noisy proxy for physical temperature, so controlling for the former does not equate observations on the latter. If we explicitly control for recorded daily temperatures (Fig. 2C), the spurious relationship is eliminated, as we would intuitively expect, r(118) = -.02, p = .81.

Given that most psychological measurements have considerable unreliability (lack of perfect correlation with the construct they are trying to get at), the problem is very general. And it can lead both to spurious conclusions of a relationship as well as spurious conclusions of a non-relationship.

I do not use ANCOVA or GLMs in this way so I may have given a misleading impression with some of what I have written or quoted above. If so, I would love to be corrected.

Bayesian jokes

It’s the end of the year, and I’m indulging myself by posting these Bayesian jokes. The first two were inspired by the #AcademicNoir hashtag.

And one outside the noir domain:

When retraction is not enough

A study suggesting that “Sadness impairs color perception” reporting two experiments was recently retracted from Psychological Science. But some colleagues and I don’t think the retraction goes far enough.

In the retraction notice, the authors suggested that after revising their second experiment to address the problems that they noted with it, they would seek to re-publish their original Experiment 1 with the revised Experiment 2.

But Experiment 1, and the basic methodology behind both experiments, is shoddy — there are more problems than just those mentioned in the retraction notice. Some of these problems are strange anomalies with the data, specific to Thorstenson et al’s experiments. Other problems, while still significant, are not uncommon in this research area.

When the now-retracted paper first appeared, twitter exploded with criticism, and many documented the study’s problems extensively on their blogs. Five of us got together over email to write a letter to Psychological Science calling for retraction. But before submitting the letter, we contacted the first author, Chris Thorstenson. He eventually told us that he and his colleagues would retract the paper.

When we saw the retraction notice, we noticed that only a few of the problems with the experiments were mentioned. I have been vexed by studies of this ilk for more than three years, and would like to see a general improvement in this research area. So we revised our letter to highlight the additional problems not mentioned by Thorste. We hope that our revised letter will help Thorstenson et al., plus other researchers in this area, to improve their methods.