In science, we frequently see a comparison between two groups of people that differ on multiple demographic variables, say age, IQ, and income, investigating some dependent measures, say body mass index (BMI).

Results are often reported as “the groups had substantially different BMIs, after controlling for X and Y” using ANCOVA or multiple regression. We are given the impression that this analysis shows that the groups would have different BMIs **even if** they had the same levels of X and Y.

Is this conclusion justified? Maybe. Extensive thought and scrutiny of the data would be required to determine whether this is a reasonable inference. I was thinking of discussing this a bit in my undergraduate teaching, so I asked about it on twitter. A bunch of people both provided helpful responses and asked me to report back.

An obvious problem is that the two groups may differ on other things besides X and Y, many of which you may not have even been measured. So the difference between the groups may be entirely attributable to those confounds. This post is about some less obvious problems. Here are some quick snippets from what people pointed me to.

First, from Miller & Chapman (2001), below. Thanks to @BrandesJanina for pointing me to this paper.

consider a data set in which two groups are older men and younger women, and gender is of interest as an independent variable, Grp. Using age as a covariate does indeed remove age variance. The problem is that, because age and gender are correlated in this data set, removing variance associated with Cov will also remove some (shared) variance due to Grp. Within this data set, there is no way to determine what values of DV men younger than those tested or women older than those tested would have provided. Far from “controlling for” age, the ANCOV A will systematically distort the gender variable. As in our presentation of Lord’s Paradox above, GrPres will not be a valid measure of the construct of gender….

Consider a data set consisting of childrens’ age, height, and weight. If we conduct an ANCOVA in which height is the covariate, age is the grouping variable, and weight is the dependent variable, we are attempting to ask whether younger and older children would differ in weight if they did not happen to differ in height. If the groups indeed do not differ on the covariate, this question can be asked. But if there is something about the construct of age in childhood that inherently involves differences in height, the question makes no sense, because then age with height partialed out would no longer be age. There is no way to “equate” older and younger children on height, because growth is an inherent (not chance or noise) differentiation of the two groups….

Cohen and Cohen (1983) provided the following extreme example: “Consider the fact that the difference in mean height between the mountains of the Himalayan and Catskill ranges, adjusting for differences in atmospheric pressure, is zero!” (p. 425), the point being that one has not in any sense “equated” the two mountain ranges by using atmospheric pressure as a covariate.

-Miller & Chapman (2001)

Let’s go back to my opening example of a BMI difference between two groups, after “controlling for” variables statistically. What if one of those variables controlled for was age? Well, if the two groups were people who exercise and people who don’t, there is very likely variance shared by age and level of exercise, and age likely has a causal influence on exercise (by various routes), so the meaning of the exercise factor is unclear after age has been removed.

### The problem of measurement (un)reliability

From Westfall & Yarkoni (submitted):

Suppose we are given city statistics covering a four-month summer period, and observe that swimming pool deaths tend to increase on days when more ice cream is sold. As astute analysts, we immediately identify average daily temperature as a confound: on hotter days, people are more likely to both buy ice cream and visit swimming pools. Using multiple regression, we can statistically control for this confound, thereby eliminating the direct relationship between ice cream sales and swimming pool deaths.

Now consider the following twist. Rather than directly observing recorded daily temperatures, suppose we obtain self-reported Likert ratings of subjectively perceived heat levels. A simulated batch of 120 such observations is illustrated in Figure 1, with the reliability of the subjective heat ratings set to 0.40—a fairly typical level of reliability for a single item in psychology1. Figure 2 illustrates what happens when the error-laden subjective heat ratings are used in place of the more precisely recorded daily temperatures. The simple relationship between ice cream sales and swimming pool deaths (Fig. 2A) is positive and substantial, r(118) = .49, p < .001. When controlling for the subjective heat ratings (Fig. 2B), the partial correlation between ice cream sales and swimming pool deaths is smaller, but remains positive and statistically significant, r(118) = .33, p < .001. Is the conclusion warranted that ice cream sales are a useful predictor of swimming pool deaths, over and above daily temperature? Obviously not. The problem is that subjective heat ratings are a noisy proxy for physical temperature, so controlling for the former does not equate observations on the latter. If we explicitly control for recorded daily temperatures (Fig. 2C), the spurious relationship is eliminated, as we would intuitively expect, r(118) = -.02, p = .81.

Given that most psychological measurements have considerable unreliability (lack of perfect correlation with the construct they are trying to get at), the problem is very general. And it can lead both to spurious conclusions of a relationship as well as spurious conclusions of a non-relationship.

I do not use ANCOVA or GLMs in this way so I may have given a misleading impression with some of what I have written or quoted above. If so, I would love to be corrected.

A related issue that might compound some of these concerns is what Paul Meehl calls the ‘crud factor’ in psychology. This describes the fact all our measures tend to be correlated in a way that in uninformative, on top of the high levels of measurement error. With a subset of such imprecise correlated measures one could imagine getting some pretty wacky results when ‘controlling’ for variables.

Yet another issue that I’ve wondered about: Suppose we have 3 variables, age and two cognitive measures. One of our cognitive measures has fewer possible values than the other. E.g. it’s a score out of 10 whereas the other is some error score accurate to, say, 2dp. Now suppose we want to look at the interaction between age and our variables. I wonder whether the fact one of our variables varies in a more fine-grained way might mess up our interaction terms. For the score out of 10 our the variability in our interaction variable (that we might calculate to put in as a predictor) will be driven by age whereas with our more precise measure the interaction variable will involve a more balanced mix of the two. I guess this is really just another example of the problems that come from imprecise measures.

An underlying principle is that statistics is no substitute for good study design. Both are essential but neither is sufficient. To take Miller & Chapman’s toy example, if you are interested in the effect of sex on something, and want to control for age, you are damned if you collect a sample of old men and young women, but damned at the design level before you even reach any statistical analysis.

Good understanding of study design needs to accompany good understanding of stats methods and their limitations.

Great post, Alex.

This is a big issue in developmental disorders (eg autism) research, where ANCOVA is commonly used to “statistically match” groups. Often it doesn’t make a difference – the group difference is either there or not regardless of whether the covariate has been added. But sometimes it can make a big difference.

Usually, as you note, the covariate takes away a group difference by sucking up variance that was associated with Group membership in the ANOVA. But occasionally it can create a group difference that wasn’t there originally. I didn’t really believe this when I first read it, but I had a play around with some old data and sure enough there are (arguably quite contrived) situations where it does happen.

We wrote this up as a methods paper using real data to illustrate some of the weird things that can happen in common group matching designs.

Brock, J., Jarrold, C., Farran, E. K., Laws, G., & Riby, D.M. (2007). Do children with Williams syndrome really have good vocabulary knowledge? Methods for comparing cognitive and linguistic abilities in developmental disorders. Clinical Linguistics and Phonetics, 21, 673-688.

The paper should be downloadable here:

https://sites.google.com/site/drjonbrock/publications/do-children-with-williams-syndrome-really-have-good-vocabulary-knowledge-methods-for-comparing-cognitive-and-linguistic-abilities-in-developmental-disorders/Brock2007Clinicallinguistics%26phonetics.pdf?attredirects=0

I guess my rule of thumb would that, if you are going to report an ANCOVA because of concerns about confounds between groups then you should always report the ANOVA as well. If the ANOVA and ANCOVA lead you to the same conclusions then there may not be too much to worry about. But if they lead to different conclusions (ie one says there’s a group effect and one says there isn’t) then you need to dig a little deeper and figure out what’s going on.

That’s all good to know, thanks Jon. There are usually other unwanted differences between the groups too, of course, but hopefully those (quality of schools attended, hours of TV watched?) aren’t the reason for many of the effects reported in the literature you work in.

The problem of covariates seems much more grave in areas like nutritional epidemiology, where the effects can be quite small and there are countless differences between those who eat say, nuts, and those who don’t. And the leads from those studies with heavy “controlling for X” in the stats almost always turn out to be dead ends when subjected to an RCT. Interestingly, in that previous link they blame p-hacking, not problems inherent to “controlling for X” statistically. But I wonder… Hope to find time to do a follow-up post on lurking variables, Simpson’s Paradox, and inferring causality a la Judea Pearl.