problems with “controlling for” variables: quick notes

In science, we frequently see a comparison between two groups of people that differ on multiple demographic variables, say age, IQ, and income, investigating some dependent measures, say body mass index (BMI). Results are frequently reported as “the groups had substantially different BMIs, after controlling for X and Y” using ANCOVA or multiple regression. We are given the impression that this analysis shows that the groups would have different BMIs even if they had the same levels of X and Y.

Is this inference valid? Maybe. Extensive thought and scrutiny of the data would be required to determine whether this is a reasonable inference. I was thinking of discussing this a bit in my undergraduate teaching, so I asked about it on twitter. A bunch of people both provided helpful responses and asked me to report back.

An obvious problem is that the two groups may differ on other things besides X and Y, many of which you may not have even been measured. So the difference between the groups may be entirely attributable to those confounds. This post is about some less obvious problems. Here are some quick snippets from what people pointed me to.

First, from Miller & Chapman (2001), below. Thanks to @BrandesJanina for pointing me to this paper.

consider a data set in which two groups are older men and younger women, and gender is of interest as an independent variable, Grp. Using age as a covariate does indeed remove age variance. The problem is that, because age and gender are correlated in this data set, removing variance associated with Cov will also remove some (shared) variance due to Grp. Within this data set, there is no way to determine what values of DV men younger than those tested or women older than those tested would have provided. Far from “controlling for” age, the ANCOV A will systematically distort the gender variable. As in our presentation of Lord’s Paradox above, GrPres will not be a valid measure of the construct of gender….

Consider a data set consisting of childrens’ age, height, and weight. If we conduct an ANCOVA in which height is the covariate, age is the grouping variable, and weight is the dependent variable, we are attempting to ask whether younger and older children would differ in weight if they did not happen to differ in height. If the groups indeed do not differ on the covariate, this question can be asked. But if there is something about the construct of age in childhood that inherently involves differences in height, the question makes no sense, because then age with height partialed out would no longer be age. There is no way to “equate” older and younger children on height, because growth is an inherent (not chance or noise) differentiation of the two groups….

Cohen and Cohen (1983) provided the following extreme example: “Consider the fact that the difference in mean height between the mountains of the Himalayan and Catskill ranges, adjusting for differences in atmospheric pressure, is zero!” (p. 425), the point being that one has not in any sense “equated” the two mountain ranges by using atmospheric pressure as a covariate.

Screen Shot 2016-02-03 at 06.56.25.png

-Miller & Chapman (2001)

Let’s go back to my opening example of a BMI difference between two groups, after “controlling for” variables statistically. What if one of those variables controlled for was age? Well, if the two groups were people who exercise and people who don’t, there is very likely variance shared by age and level of exercise, and age likely has a causal influence on exercise (by various routes), so the meaning of the exercise factor is unclear after age has been removed.

The problem of measurement (un)reliability

From Westfall & Yarkoni (submitted):

Suppose we are given city statistics covering a four-month summer period, and observe that swimming pool deaths tend to increase on days when more ice cream is sold. As astute analysts, we immediately identify average daily temperature as a confound: on hotter days, people are more likely to both buy ice cream and visit swimming pools. Using multiple regression, we can statistically control for this confound, thereby eliminating the direct relationship between ice cream sales and swimming pool deaths.

Now consider the following twist. Rather than directly observing recorded daily temperatures, suppose we obtain self-reported Likert ratings of subjectively perceived heat levels. A simulated batch of 120 such observations is illustrated in Figure 1, with the reliability of the subjective heat ratings set to 0.40—a fairly typical level of reliability for a single item in psychology1. Figure 2 illustrates what happens when the error-laden subjective heat ratings are used in place of the more precisely recorded daily temperatures. The simple relationship between ice cream sales and swimming pool deaths (Fig. 2A) is positive and substantial, r(118) = .49, p < .001. When controlling for the subjective heat ratings (Fig. 2B), the partial correlation between ice cream sales and swimming pool deaths is smaller, but remains positive and statistically significant, r(118) = .33, p < .001. Is the conclusion warranted that ice cream sales are a useful predictor of swimming pool deaths, over and above daily temperature? Obviously not. The problem is that subjective heat ratings are a noisy proxy for physical temperature, so controlling for the former does not equate observations on the latter. If we explicitly control for recorded daily temperatures (Fig. 2C), the spurious relationship is eliminated, as we would intuitively expect, r(118) = -.02, p = .81.

Given that most psychological measurements have considerable unreliability (lack of perfect correlation with the construct they are trying to get at), the problem is very general. And it can lead both to spurious conclusions of a relationship as well as spurious conclusions of a non-relationship.

I do not use ANCOVA or GLMs in this way so I may have given a misleading impression with some of what I have written or quoted above. If so, I would love to be corrected.

Bayesian jokes

It’s the end of the year, and I’m indulging myself by posting these Bayesian jokes. The first two were inspired by the #AcademicNoir hashtag.




And one outside the noir domain:



When retraction is not enough

A study suggesting that “Sadness impairs color perception” reporting two experiments was recently retracted from Psychological Science. But some colleagues and I don’t think the retraction goes far enough.

In the retraction notice, the authors suggested that after revising their second experiment to address the problems that they noted with it, they would seek to re-publish their original Experiment 1 with the revised Experiment 2.

But Experiment 1, and the basic methodology behind both experiments, is shoddy — there are more problems than just those mentioned in the retraction notice. Some of these problems are strange anomalies with the data, specific to Thorstenson et al’s experiments. Other problems, while still significant, are not uncommon in this research area.

When the now-retracted paper first appeared, twitter exploded with criticism, and many documented the study’s problems extensively on their blogs. Five of us got together over email to write a letter to Psychological Science calling for retraction. But before submitting the letter, we contacted the first author, Chris Thorstenson. He eventually told us that he and his colleagues would retract the paper.

When we saw the retraction notice, we noticed that only a few of the problems with the experiments were mentioned. I have been vexed by studies of this ilk for more than three years, and would like to see a general improvement in this research area. So we revised our letter to highlight the additional problems not mentioned by Thorste. We hope that our revised letter will help Thorstenson et al., plus other researchers in this area, to improve their methods.


What just happened with open access at the Journal of Vision?

Vision researchers recently received an email from ARVO, the publisher of Journal of Vision, that begins:

On January 1, 2016, Journal of Vision (JOV) will become open access.

But in the view of most, JoV has been open access since its inception! It’s always been an author-pays, free access journal: all articles are published on its website and can be downloaded by anyone.

But free-to-download is not enough for open access, not according to the definition of open access formulated in Budapest in 2001. Open access means (according to this definition) the right not only to download but also to

distribute, … pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers

But JoV, which has always held the copyright of the articles it publishes, says that “All companies, commercial and nonprofit, should contact ARVO directly for permission to reprint articles or parts thereof”.

Starting in 2016, such permission won’t be required.

However, paying your $1,850 for standard publication in JoV in 2016 will not get you everything. The updated Budapest declaration recommends that journals use the license CC-BY.  But JoV‘s publisher has instead chosen to use the license CC BY-NC-ND, , meaning that articles cannot be used commercially (“non-commercial”) and that you can’t distribute bits of the article (“no derivatives”). Yet increasingly today, parts of science involve mining and remixing previously-published data and content, which the ND clause of the license prohibits (unless you get special permission). Education and journalism requires re-use of bits too; think about how many textbooks and articles on the web show just one figure (a “derivative”) or illustration from a scientific paper.

And while the non-commercial, NC clause might sound rather harmless for spreading knowledge, it is sometimes unclear what non-commercial really covers. It may prevent a university, especially private universities, from distributing the article as part of course content that a student pays for (via their tuition).

For these reasons, CC BY is the way we should be going, which is why UK funders like the Wellcome Trust and RCUK require that researchers receiving grants from them publish their articles CC BY. To accommodate this, JoV as part of their new policy will license your article as CC BY, if you pay an additional fee of $500!

What ARVO has done here is only a small step forward for JoV, and unfortunately a rather confusing step. The bigger change has occurred with ARVO’s journal Investigative Ophthalmology and Vision Science (IOVS), which was only accessible via a subscription but starting in 2016, will be CC BY-NC-ND and CC BY.

As you can see, copyright is complicated. Researchers don’t have time to learn all this stuff. And that means recalcitrant publishers (not ARVO, I mean profiteers like Elsevier) can exploit this to obfuscate, complicate, and shift their policies to slow progress towards full open access.

Thanks to Tom Wallis and Irina Harris.

P.S. I think if ARVO had only been changing JoV‘s policies (rather than also the subscription journal IOVS) they wouldn’t have written “JoV will become open access” in that mass email. But because they did, it raised the issue of the full meaning of the term.

P.P.S. Partly because JoV is so expensive, at ECVP there’ll be a discussion of other avenues for open access publishing, such as PeerJ. Go! (I’ll be stuck in Sydney).

Yellow journalism and Manhattan murders

The headline screams “You’re 45% more likely to be murdered in de Blasio’s Manhattan”.

The evidence? Sixteen people have been killed so far this year in Manhattan, against only eleven over the same period last year.

Does this evidence indicate you are more likely to be murdered, as the headline says? To find out, I tested whether a constant murder rate could explain the results. The probability of getting murdered over the same period last year may be approximately 11/Manhattan’s population = 11/1,630,000 = 0.0000674 = .00674%.

Is it likely that with the same murder rate this period this year, one would get a number as high as 16 murders? Yes.

This can be seen by calculating the 95% confidence interval for 11/1,630,000, which according to 3 different statistical methods, spans 5 to 20. That is, even with a constant murder rate, due to statistical fluctuations, the murders over this period could easily have been as low as 5 or as high as 20.  Just like if one flips a coin 10 times, one may get 3 heads the first time and 6 the next, without the chance of a head changing.

Doing this more properly means comparing the two rates directly.  I did this using three different methods, all of which found no significant difference.

The article also reports that the number of shooting incidents is higher this year, 50 instead of 31. Using the three different statistical methods again, this was (barely) significantly different. So here the journalist has a point. But this should be taken with a big grain of salt. Journalists are always looking for “news”, and if they repeatedly look at how many people have been murdered/shot, eventually they are guaranteed to find an apparent difference, because all possible statistical fluctuations will happen eventually.

The statistics and the code are here.

I only did all this and wrote this post because Hal Pashler saw someone tweet the NYPost piece. Hal knew I had previously looked into the statistics of proportions and asked whether the headline was justified. I invite others to disagree with my calculations if they have a better way of doing it. I don’t think different methods will give a very different result, however.

Four Reasons to Oppose the Use of Elsevier’s Services for the Medical Journal of Australia

Elsevier has a history of unethical behaviour:
  1. Elsevier created fake medical journals to promote Merck products.
  2. Elsevier sponsored arms fairs for the international sale of weapons.
  3. Elsevier sponsored a bill that would have eliminated the NIH mandate that medical research be make freely available within 12 months of publication.
  4. Elsevier requires university and medical libraries to sign agreements that prevent them from reporting the exorbitant prices the libraries pay to subscribe to Elsevier’s journals.
Thanks to such practices, Elsevier makes an outrageous level of profit, 36% of revenue- higher than BMW and higher than the mining giant Rio TintoprofitChart. While researchers and research funders are attempting to transition medical and science publishing to an open access model, Elsevier seeks to hinder this transition. It is their corporate mandate to preserve the high level of profits they make by charging subscription fees for the articles that describe taxpayer-funded research.


Researchers ought to be using other providers, not channeling more money into Elsevier.


A “tell” for researcher innumeracy?

Evaluating scientists is hard work. Assessing quality requires digging deep into a researcher’s papers, scrutinising methodological details and the numbers behind the narrative. That’s why people look for shortcuts such as the number of papers a scientist has published or the impact factor of the journals published in.

When reading a job or grant application, I frequently wonder: Does this person really take their data seriously and listen to what it’s telling them, or are they just trying to churn out papers? It can be hard to tell. But I’ve noticed an unintentional tell in the use of numbers. Some people, when reporting numbers, habitually report far more decimal places than are warranted.

For example, Thomson/ISI reports its much-derided journal impact factors to three decimal places. This is unwarranted, an example of false precision, both because of the low counts of article numbers and citations typically involved, and because their variability year to year is high. One decimal place is plenty (and given how poor a metric impact factor is, I’d prefer that impact factor simply not be used).

When I see a CV with journal impact factor reported to three decimal places, I feel pushed toward the conclusion that the CV’s owner is not very numerate. So the reporting of impact factor is useful to me; not, however, in the way the researcher intended.

I don’t necessarily expect every researcher to fully understand the sizes, variability, and distribution of the numbers that go into impact factor, so I’m more concerned by how researchers report their own numbers. When to report all the decimal places calculated can be a subtle issue however, as full reporting of some numbers is important for reproducibility.

Bottom line, researchers should understand how summaries of data behave. Reporting numbers with faux precision is a bad sign.

For references on the issue of the third decimal place of impact factor:

UPDATE 8 May: Read this blog on the topic

Bar-Ilan, J. (2012). Journal report card. Scientometrics, 92, 249–260.

Mutz, R., & Daniel, H. D. (2012). The generalized propensity score methodology for estimating unbiased journal impact factors. Scientometrics, 92, 377–390.