An open access fail

In this post I dissect the response by the editors of Cognition to a mass appeal for open access by the researcher community. I hope that my rather critical comments will improve understanding of the issues and help the community achieve better outcomes in the future.

Cognition is a scientific journal published by Elsevier that was traditionally available only by subscription. Some years ago, like most other Elsevier journals, Cognition became a “hybrid” journal: authors can make their particular paper open access, for a fee termed an APC or author processing charge. In the case of Cognition the APC is very high – $2150. And as to the subscription fees, most universities subscribe to Cognition as part of a larger “big deal” package, the very high fees for which help give Elsevier their operating profit of over 30% – well exceeding that of BMW, Google, and Apple, and off the public taxpayer’s tit, not by developing new products or services.

Lingua was a prestigious linguistics journal published by Elsevier and in much the same situation as Cognition. Its editors, including the editor-in-chief, Johan Rooryck, told Elsevier that they’d like to transition to what they called “Fair Open Access” – making the journal open access rather than requiring an expensive subscription to read the journal, with an APC fee of 400 Euros or less, CC-BY licensing of articles with copyright remaining with the authors, and full editorial control of the journal.

Fair Open Access is how journals really should be set up, with the publisher in the role of a service provider, not in the role of owner of research articles (the content of which are typically almost entirely funded by university or government funds). When Elsevier refused to agree to this model, Rooryck and the other editors walked. They started a new journal, Glossa, published by the non-profit Open Library of the Humanities (OLH). The OLH model easily exceeds the ambition of Fair Open Access – thanks to monetary contributions by over one hundred university libraries, authors are charged no APC feeGlossa is free to publish in (although authors with open-access funds available to them are asked to optionally contribute 400 Euros) and its content is free to read and re-use. Thanks to Rooryck’s leadership and no doubt the community rallying together, all  of the editors and the editorial board and many or all of the authors moved over to Glossa, bringing their prestige along with them.

Wouldn’t it be great if other scientific journals followed suit? David Barner (UCSD) and Jesse Snedeker (Harvard) of the editorial board of Cognition thought so. They appealed to the main Cognition editors to investigate the possibility of Fair Open Access. And they started a petition, which was signed by more than 1500 members of the Cognition community, including many famous (such as Noam Chomsky, Nancy Kanwisher, and Liz Spelke) as well as not famous researchers(like me) who publish in Cognition.

The response by the editors of Cognition appeared as an editorial in the journal. The editors say that in response to Barner and Snedeker’s appeal and the associated petition, they polled the editorial board about their opinions on “their satisfaction with the journal and their attitudes about the journal’s role in the open dissemination of science” and got a response rate of 60%. Sixty percent? This looks like a failure of leadership by the editors. Here he is asking critically important questions about the nature and future of the journal, and he gets responses from not much more than half of the editorial board. Presumably every member of the editorial board is doing serious work for the journal – editing the occasional manuscript, in response to the editors. If not, those editorial board members should be asked to resign. So there’s essentially a 100% (eventual) response rate to editing requests, which is a lot more work than answering questions about satisfaction with the journal and their attitudes about the journal’s role in the open dissemination of science. Of course, I do not know how much of this is a failure of leadership by the editor in chief, real recalcitrance by the editorial board, or an intentionally weak effort by an editor in chief who doesn’t want to change anything.

The editorial continues:

While the editorial board expressed a range of opinions, most members were happy with the journal’s relationship with Elsevier.

I’d expect well-informed and public-minded, or even just university-minded, scholars to be less than happy. I’d expect them to resent how much money Elsevier sucks out of our universities as corporate profit, and to resent Elsevier’s ownership of the copyright to the research. Still, being “happy with the relationship” is an ambiguous statement; the editorial board members might still strongly support some of the planks of Fair Open Access.

After a list of the services that Elsevier provides (with no indication that those services couldn’t be provided by OLH or others), the editorial continues:

The poll also indicated striking consensus on the open access issue: The editorial board was happy with the journal’s mixed approach to dissemination, but it felt strongly that open access fees are too high. They felt that a substantial reduction in open access fees would make the option more attractive to authors, with the effect of increasing access around the world to scientific work published in the journal, work that is frequently publicly funded.

By now one can infer that the editors have already given up on (or never tried for) four out of five of the Fair Open Access points. More about that in my next post. But here, we do have a strong statement in support of the request for reasonable APCs.

the editors at Cognition approached Elsevier with a request to lower open access fees. A process of negotiation ensued with the result that Elsevier will start a fund to defray open-access costs for those authors with limited means of support.

OK, negotiation and compromise was to be expected. But what exactly is this fund? The editorial continues:

 

Authors whose articles are accepted after 1st May, 2016 can apply by requesting a form from the editorial office: cognition@elsevier.com.

Decisions to grant discounts are at the discretion of the Editor-in-Chief, in consultation with the Publisher.

Accepted authors always have the choice to publish their article as a subscription article at no cost (even after requesting an APC fee reduction), and the subscription option includes Green Open Access https://www.elsevier.com/about/open-science/open-access. Cognition has an embargo period of 12 months.

APC discounts must be requested within one week of acceptance, and will have no impact on the decision made by an editor whether or not to accept the associated paper.

 

What does this amount to? There’s no information about how large the discounts will be or how many are available. Elsevier will continue to charge an outrageous $2150 APC fee to most, with a completely unknown discount for some. I have it on good authority that the actual cost (not counting the contributions from university libraries that bring the author cost to zero) to publish an article open access for OLH is much less than $1000.

Is this discount of variable amount and unknown total extent a decent outcome of the editors’ (ostensible) attempt to fight for scholars’ interests? Let’s set aside the point that only one of the original five requests was put to Elsevier. Even with the one remaining, there is no information provided about the value of the limited concession they got.

As a signatory to the Fair Open Access petition and a researcher who’s published in Cognition, I’m very upset by both the outcome and the process. I’d expect a large minority or significant majority of the other 1,650 other signatories to also be unhappy.

Barner, Levy, and Snedeker have described their reaction. Yes, they too are unhappy. They consider the “reasonable compromise” with Elsevier (in the words of the editorial) to be not only unreasonable but also unethical:

While paying APCs to Elsevier might make individual articles publicly available, this is neither necessary, since there exist FREE ways to accomplish this same goal (see below), nor ethical, because it spends even more taxpayer dollars without significantly affecting the global problem of access.

To top it all off,  as if to say who’s really the boss, the editors’ editorial is not legally their own. As it says at the bottom of the article, it is “copyright Elsevier B.V., 2016”.

In my next post, I’ll try to consider what we should learn from all this, with a view towards new efforts. If you want to jump to a specific new effort, see the end-run action around Elsevier being promoted by Barner, Levy, and Snedeker.

 

#academicNoir memes

When #AcademicNoir trended on Twitter, I had fun making a few memes about science publishing and the Registered Replication Reports that we started at Perspectives on Psychological Science.

Screen Shot 2016-09-05 at 14.09.02.png

“Who’s shaking you down?” I asked.

“Elsevier,” whispered the librarian.

I showed her the door.  I still have to work in this town. 




heck.png

Any one of you go it alone, he’ll say you messed up. But if we first get him to approve the protocol, and then all run the replication together…


Screen Shot 2016-09-05 at 14.10.45.png

Well, Mr. P. Hacker, your luck is about to run out.

Down at the journals, they’re running a new game. They call it ‘preregistration’.

 https://twitter.com/ceptional/status/663850868396044288



“How about we just delete these two data points?”

She looked shocked.

“Listen- you want to get this published, or not?”


They called him “Big Pharma”. Really shady character. Really knew how to make a data set go missing.



We know you’re in there. We’ve got your lab surrounded.

OK, I’ll come out.

Don’t move! Just email us the data. The *raw* data.



He had a good run, for a while. But then he got an email from . Asking for the raw data. His time was up.



“But you haven’t even seen my numbers yet!” she said.

“Just give Dr. Hacker a little time alone with them,” I told her.

The attentional blink and temporal selection

 

While our retinas process the world around us continuously with little to no interruption, certain mental processes are very limited. One way to show this it to present someone with a rapid stream of letters, one at a time in the center of the screen, at a rate of perhaps 10 letters per second. Give them the task of reporting one or more letters, with each target letter to report designated by presenting a circle around it.

When only one letter is designated for report, the task is pretty easy. When two letters are designated for report and they are far apart in time (say, a full second apart), performance is also high. But when the second letter appears within several hundred milliseconds of the first, something strange happens. Often one feels one not did not see the second ring or letter at all. This is the attentional blink (AB).

For the last twenty years, researchers have investigated which aspects of mental processing are impaired for the second, seemingly unseen target. For many years, the leading theory was that the second target is successfully selected by attention but that later stages are fully occupied with the first target, so that the second target decays before being fully processed.

More recent theories suggest that not only is post-selection target processing impaired, but selection itself is also impaired. After selecting the first target, the attempt to select the second target goes awry. 

The idea that selection is disrupted has had legs, empirically and theoretically. Yet we know surprisingly little about the temporal characteristics of selection even when it succeeds. Vul, Nieuwenstein, & Kanwisher (2008) made a start at remedying this by pointing out that temporal selection could be thought of as having two different aspects – latency and precision.

Latency is the delay in time between the appearance of the target and the stimulus that is ultimately sampled from the scene and reported. Precision is the variability of this latency on different occasions. The attentional blink might affect either, both, or neither of these parameters. To understand how it could be neither, consider that the blink might prevent selection independently of whether or not its time-course is affected.

Vul, Nieuwenstein, & Kanwisher (2008) had a go at quantifying latency and precision. Their methodology was flawed (all the details are in our in-press manuscript), but their basic idea was sound. We applied it to six different datasets from five different labs, and found a similar pattern of results in each. I’ll tell you some of the highlights.

The basic experimental set-up is to present a rapid stream of letters or objects with two of the stimuli indicated as targets, e.g. by presenting a circle around them. The task in this case is to report the two stimuli that were circled.

Sometimes participants report both stimuli correctly and sometimes they get one or both wrong, and this varies with the amount of time between the targets. Traditionally, such percent-correct measures are where the data analysis ends. But Vul et al. had the bright idea of scrutinising the nature of participants’ incorrect responses. Some errors appear to be completely random, but frequently a participant will report the stimulus a few items before or after the target. These near-misses are conspicuous on the serial position error histogram, below, in which the response on every trial is coded relative to the target. If a participant reports the item two before the target, we add it to the “-2” bar of the histogram. If the item reported occurred three items after the target, it’s coded as “3”, and so on.

histogramExplain

The first thing that jumps out is that the error distribution is rather symmetrical, meaning that there are approximately as many items reported that occurred before the target as after (in the histogram pictured above, it looks somewhat positively skewed, but that’s largely because the mean falls between position zero and position one, while the histogram bars correspond to whole numbers).

The near-symmetry in time suggests that temporal selection works in a particular way. Before looking at data like these, we assumed that when the circle appeared, it triggered attentional sampling of a stimulus from the scene. If that occurred, however, one would expect a histogram with a substantial positive skew, as variability in time until sampling could widen the distribution only forward in time. Our finding of nearly as many reports from before the target as after hints that rather the cue triggering the commencement of sampling from the stream, the stream is being sampled all along. Which item is ultimately reported may be determined by a subsequent process that binds the ring with one of the persisting representations evoked by the stimuli in the stream.

Which brings me to another remarkable thing: the temporal precision of selection is not affected by the blink. That is, the spread of the histogram, once random guesses were subtracted out, is the same for T1 and T2, for short lags and for long. In unpublished experiments, we found that regardless of stream presentation rate, it is around 80-100 ms (standard deviation of a fitted normal distribution), depending on the participant and the experiment. Looks like the binding process is something that attentional blink theories ought to theorise about.

Unlike the precision of selection, the latency was quite affected by the blink. At the intervals between the first and second target that yield a blink, selection was quite delayed. That is, the histogram was centered on 1-2 items after the target. Except for very short intervals, where it seemed only a single histogram was present, indicating that people manage to sample both targets from a single attentional episode (the explanation we favor for “lag-1 sparing”). This supports those theories that include a disruption to selection, although none have yet addressed temporal precision.

While the sheer number of studies and findings in the attentional blink literature is quite intimidating, a number of theorists have risen to the challenge, developing computational theories that purport to fit the most important empirical results (e.g. 12, 3). As a result of such explicitness, many theories in this area have strong potential to be disconfirmed. One thing I’d like to understand is whether any can explain the contrast between our attentional blink results and some results we found previously (paper, blogpost).

In that previous work, we found that not just precision but also latency could be robust to the demands of encoding a second target. We presented two streams concurrently and on some trials presented a target in each stream. Both targets appeared at the same time. In those circumstances, participants’ performance decreased substantially in the two-target condition, but precision and latency were both unaffected. That the second target impaired performance by approximately as much as it is during an attentional blink, while the properties of selection were unaffected, seems to be a problem for theories that attribute the blink largely to the disruption of selection.

Patrick Goodbourn deserves most of the credit for this work, plus a badge, or maybe two, for putting everything together as open data and open materials (we actually are getting two badges, thanks to the policy of Psychological Science to reward open practices!).

Goodbourn PT, Martini P, Harris I, Livesey E, Barnett-Cowan M, & Holcombe AO (in press). Reconsidering temporal selection in the attentional blink. Psychological Science (postprint, data).

What Reproducibility Crisis?

It’s not a great term, the “reproducibility crisis”.

Most don’t think we are actually in a crisis, but I thought that by now practically everyone in various scientific fields had heard of it. In the last few years the reproducibility issue has been covered several times by Nature News, by major newspapers such as the New York Times, and by countless websites, often with truly crisis-level headlines. I first blogged about it in 2012, and it wasn’t new then.

The heart of the issue is that large-scale replication attempts usually fail to reproduce the findings of the original studies. Or at least, they fail to yield the statistically significant finding that the original study did, suggesting either that the original study did not have much statistical power and the original authors got lucky, or it’s a spurious result, or the effect of interest is not very robust to the methodological variation associated with replication attempts. Across preclinical cancer research, economics, and experimental psychology, the results have been similarly depressing. As an editor I’ve been involved in two attempts to replicate individual studies, and both yielded null results (one on ego depletion, and one on how grammatical aspect affects judgments about a criminal).

There are certainly people (Harvard professors, in fact) who still say that “the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%”. But if you had asked me how many psychology academics believe that it is a significant problem I would have said at least half. Experimental psychology in particular has seen a raft of large-scale replication attempts and very public failures to replicate. Before the Reproducibility Project that attempted to replicate 100 studies, there was Many Labs 1. Now, Many Labs 2 has finished data collection and Many Labs 3 is in process. The conversations around replications have reached near-meme levels of rhetoric. If people in any area should have heard of the reproducibility crisis by now, it’s psychology researchers.

I find out about a lot of replication attempts on twitter, with reproducibility news showing up on my feed on a near-daily basis. I was wondering whether the reproducibility crisis is much of a thing to your average psychology academic. Note: your average psychology academic is not on twitter.

In a seemingly self-defeating effort, I set up a poll on twitter of people not on twitter:

I was expecting about five responses. Maybe ten. But I got fifty-eight!

Screen Shot 2016-04-13 at 16.55.18

Let’s acknowledge that this is an unscientific sample with a lot of selection bias. Who knows how these 58 people got their datum? (I say “datum”, not data, because you can only vote once per twitter account).

We have 26% of people who have never heard of the reproducibility crisis, 40% who are skeptical that it’s a problem, and only 34% who think it’s a major problem.

It could be that people saw this as an entertaining opportunity to troll me. I doubt that and suspect that people actually had a real-world collegial interaction as a result of this tweet. They may have avoided colleagues who they’d previously spoken to about reproducibility. And they may have skipped the nose-to-the-grindstone types, continuing on to a colleague with an open office door. That would be pretty good, but it’s also possible they went straight to the prof they know doesn’t keep up with the times.

With biases in mind, let’s consider the numbers. We’ve got 66%, 38 people, who have either never heard of the crisis or are skeptical that it’s a problem. The 95% confidence interval on that figure (adjusted Wald method) runs from 79% down to 55%. That’s a sobering lower limit.

These people are out there, in significant numbers. We ought to keep this in mind when communicating with people about the latest replication failure or the latest call for publishing reform, be it greater disclosure in methods sections or for preregistration. Many people will continue to see these things as burdensome solutions for a problem that may not exist.  Myself, I like to think of preregistration as forcing people to keep their grubby little p-hacking hands from contaminating what could otherwise be a truth-revealing beautiful bit of science.

The positive side is that a very large number of people have become convinced of the value of preregistration in a very short span of time. Already a new experimental psychology journal that publishes only preregistered studies has been created, and at many journals, a new preregistered article type has popped up. Put in perspective, progress has indeed been quite rapid. Consider the molasses-like slog of the open access movement. It took decades of proselytising, explainers, and news for everyone to have a rough idea of what open access is. And still today you find people assuming that the only existing or viable route is author-pays (even though there are thousands of open-access journals that charge authors nothing). Open access is a complex issue, and so are reproducibility issues. It should take a long time to get very far with either.

 

tweets from ResBaz Sydney

problems with “controlling for” variables: quick notes

In science, we frequently see a comparison between two groups of people that differ on multiple demographic variables, say age, IQ, and income, investigating some dependent measures, say body mass index (BMI).

Results are often reported as “the groups had substantially different BMIs, after controlling for X and Y” using ANCOVA or multiple regression. We are given the impression that this analysis shows that the groups would have different BMIs even if they had the same levels of X and Y.

Is this conclusion justified? Maybe. Extensive thought and scrutiny of the data would be required to determine whether this is a reasonable inference. I was thinking of discussing this a bit in my undergraduate teaching, so I asked about it on twitter. A bunch of people both provided helpful responses and asked me to report back.

An obvious problem is that the two groups may differ on other things besides X and Y, many of which you may not have even been measured. So the difference between the groups may be entirely attributable to those confounds. This post is about some less obvious problems. Here are some quick snippets from what people pointed me to.

First, from Miller & Chapman (2001), below. Thanks to @BrandesJanina for pointing me to this paper.

consider a data set in which two groups are older men and younger women, and gender is of interest as an independent variable, Grp. Using age as a covariate does indeed remove age variance. The problem is that, because age and gender are correlated in this data set, removing variance associated with Cov will also remove some (shared) variance due to Grp. Within this data set, there is no way to determine what values of DV men younger than those tested or women older than those tested would have provided. Far from “controlling for” age, the ANCOV A will systematically distort the gender variable. As in our presentation of Lord’s Paradox above, GrPres will not be a valid measure of the construct of gender….

Consider a data set consisting of childrens’ age, height, and weight. If we conduct an ANCOVA in which height is the covariate, age is the grouping variable, and weight is the dependent variable, we are attempting to ask whether younger and older children would differ in weight if they did not happen to differ in height. If the groups indeed do not differ on the covariate, this question can be asked. But if there is something about the construct of age in childhood that inherently involves differences in height, the question makes no sense, because then age with height partialed out would no longer be age. There is no way to “equate” older and younger children on height, because growth is an inherent (not chance or noise) differentiation of the two groups….

Cohen and Cohen (1983) provided the following extreme example: “Consider the fact that the difference in mean height between the mountains of the Himalayan and Catskill ranges, adjusting for differences in atmospheric pressure, is zero!” (p. 425), the point being that one has not in any sense “equated” the two mountain ranges by using atmospheric pressure as a covariate.

Screen Shot 2016-02-03 at 06.56.25.png

-Miller & Chapman (2001)

Let’s go back to my opening example of a BMI difference between two groups, after “controlling for” variables statistically. What if one of those variables controlled for was age? Well, if the two groups were people who exercise and people who don’t, there is very likely variance shared by age and level of exercise, and age likely has a causal influence on exercise (by various routes), so the meaning of the exercise factor is unclear after age has been removed.

The problem of measurement (un)reliability

From Westfall & Yarkoni (submitted):

Suppose we are given city statistics covering a four-month summer period, and observe that swimming pool deaths tend to increase on days when more ice cream is sold. As astute analysts, we immediately identify average daily temperature as a confound: on hotter days, people are more likely to both buy ice cream and visit swimming pools. Using multiple regression, we can statistically control for this confound, thereby eliminating the direct relationship between ice cream sales and swimming pool deaths.

Now consider the following twist. Rather than directly observing recorded daily temperatures, suppose we obtain self-reported Likert ratings of subjectively perceived heat levels. A simulated batch of 120 such observations is illustrated in Figure 1, with the reliability of the subjective heat ratings set to 0.40—a fairly typical level of reliability for a single item in psychology1. Figure 2 illustrates what happens when the error-laden subjective heat ratings are used in place of the more precisely recorded daily temperatures. The simple relationship between ice cream sales and swimming pool deaths (Fig. 2A) is positive and substantial, r(118) = .49, p < .001. When controlling for the subjective heat ratings (Fig. 2B), the partial correlation between ice cream sales and swimming pool deaths is smaller, but remains positive and statistically significant, r(118) = .33, p < .001. Is the conclusion warranted that ice cream sales are a useful predictor of swimming pool deaths, over and above daily temperature? Obviously not. The problem is that subjective heat ratings are a noisy proxy for physical temperature, so controlling for the former does not equate observations on the latter. If we explicitly control for recorded daily temperatures (Fig. 2C), the spurious relationship is eliminated, as we would intuitively expect, r(118) = -.02, p = .81.

Given that most psychological measurements have considerable unreliability (lack of perfect correlation with the construct they are trying to get at), the problem is very general. And it can lead both to spurious conclusions of a relationship as well as spurious conclusions of a non-relationship.

I do not use ANCOVA or GLMs in this way so I may have given a misleading impression with some of what I have written or quoted above. If so, I would love to be corrected.

Bayesian jokes

It’s the end of the year, and I’m indulging myself by posting these Bayesian jokes. The first two were inspired by the #AcademicNoir hashtag.

And one outside the noir domain: