Heteroskedasticity. Hard to say, and often forgotten, but something that should be at least considered by many data scientists and people who work with data. Data is neither good, bad nor ugly, but it can take on such qualities when we analyze that data. Newly minted-PhDs can get their first R01 grants (good), scientists can waste a new grant (bad), or scientists can gain funding fraudulently (ugly). Therefore, how we think about and interpret our data is important step to determining the data’s fate. Heteroskedasticity is one of the concepts that can determine whether your model for your data is good or bad. However, before we explore heteroskedasticity, we need to look at statistical modeling, study design and analysis principles that precede it.
From the beginning, one imperative that seems to be eluding most scientists with data these days is Hyman’s Categorical Imperative. Never met or read any of Hyman’s stuff? Well, it’s just a fancy rule for stating what most data skeptics try to follow. Hyman’s Categorical Imperative -- a maxim coined by Professor Emeritus of Psychology at the University of Oregon, Ray Hyman – states that before you choose to investigate if a phenomenon is true, you should first determine if the phenomenon is real. Essentially, violating this imperative is the scientific equivalent of flippantly adding a trend line to data based on what the data appears to behave as. This is usually manifested in scientific studies as a linear trend line, often after least squares regression analysis, and can be seen in the following examples (with captions).
Figure 1 from AJP Endo Journal. Change in prolactin secretion vs. change in leptin secretion over 24hrs. The model’s fitting of least squares linear regression states that as leptin decreases/increases over 24hrs, prolactin secretion acts similarly. However, a quick scan of the data shows points in the plot populating mostly the -20 to 0 range in both axes. It consistently under-predicts positive and some negative changes in the hormones.
Figure 2 prepared by data scientists Ng and Blanchard. Deaths per state as a linear function of obesity rate. The trend line predicts a linear relationship between obesity rates and deaths per state. However, the model under-predicts some states with high death rates and low obesity rates. The data seems to be plotting an explanatory variable with no firm correlation to the dependent variable (more on this in the blog post).
Figure 3 from data scientists at TIKD app. Tickets per capita as a negative linear function of income. The trend line predicts that as income goes up, ticket issuance goes down. However, the model under-predicts all high incomes (>~$82,000) and even under-predicts low incomes (<$40,000), showing the influence that data point density can have on linear least squares regression models.
As we have discussed in class, least squares can be a good model for some data, but not all data (see above figures). The problem with least squares regression analysis is that it happens to be unusually influenced by where large amounts of data points accumulate or where unusual points scatter in Cartesian space, i.e., the variation in variance of the data points. The figures above are all good examples of this.
Generally, when data scientists want to look at how well a model works, they look at the residuals of the observations and the model’s predictions (data point value-model prediction value). If there is a pattern that appears due to the model’s application, then the model probably isn’t the best fit for the data. In other words, the model could be consistently over-predicting or under-predicting past or future points of data.
So what exactly is this unusually long word and what does it have to do with what we mentioned? Well, as described by Stats Make Me Cry, heteroskedasticity “refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it.” Or, more simply, when the variance of data is conditioned on some other variable not shown. For the figures above, that could be some other variable like another hormone (Figure 1), an environmental source negatively affecting health in the state (Figure 2), the uneven policing of certain communities (Figure 3), or the cautiousness of drivers in older ages who just so happen to have higher incomes (Figure 3).
Why is it that heteroskedasticity gets no love in the classroom then? Well, it’s something that can typically be ignored when scientists analyze data. That’s because the presence of heteroskedasticity doesn’t truly bias the fitting of least squares regression, but it will lead to problems when analyzing variance, like in ANOVA tests. You can’t assume homogeneity of data without first analyzing heteroskedasticity of the data. Finally, heteroskedasticity might not get a lot of love from the life sciences, but it is a huge deal in the social sciences, especially in economics, and the subfield of econometrics, or the branch of economics that uses math to study the behavior of economic systems. Economists rely on homogeneity of data to give reliable confidence intervals and hypothesis tests. Maybe us life scientists should start looking for this then, too. So remember, the next time you assume homogeneity of data, make sure you use a logarithmic transformation to account for heteroskedasticity, or use these other cool tricks found on page 6 of Richard Williams’s lecture from Notre Dame.