Heteroskedasticity. Hard to say,
and often forgotten, but something that should be at least considered by many
data scientists and people who work with data. Data is neither good, bad nor
ugly, but it can take on such qualities when we analyze that data. Newly
minted-PhDs can get their first R01 grants (good), scientists can waste a new
grant (bad), or scientists
can gain funding fraudulently (ugly). Therefore, how we think about and
interpret our data is important step to determining the data’s fate.
Heteroskedasticity is one of the concepts that can determine whether your model
for your data is good or bad. However, before we explore heteroskedasticity, we
need to look at statistical modeling, study design and analysis principles that
precede it.
From the beginning, one imperative
that seems to be eluding most scientists with data these days is Hyman’s
Categorical Imperative. Never met or read any of Hyman’s stuff? Well, it’s just
a fancy rule for stating what most data skeptics try to follow. Hyman’s
Categorical Imperative -- a maxim coined by Professor Emeritus of Psychology at
the University of Oregon, Ray Hyman – states that before you choose to
investigate if a phenomenon is true, you should first determine if the
phenomenon is real. Essentially, violating this imperative is the scientific
equivalent of flippantly adding a trend line to data based on what the data
appears to behave as. This is usually manifested in scientific studies as a
linear trend line, often after least squares regression analysis, and can be
seen in the following examples (with captions).
Figure 1 from AJP Endo Journal. Change in prolactin
secretion vs. change in leptin secretion over 24hrs. The model’s fitting of
least squares linear regression states that as leptin decreases/increases over
24hrs, prolactin secretion acts similarly. However, a quick scan of the data
shows points in the plot populating mostly the -20 to 0 range in both axes. It
consistently under-predicts positive and some negative changes in the hormones.
Figure 2 prepared by data scientists Ng and Blanchard. Deaths per state
as a linear function of obesity rate. The trend line predicts a linear
relationship between obesity rates and deaths per state. However, the model
under-predicts some states with high death rates and low obesity rates. The
data seems to be plotting an explanatory variable with no firm correlation to
the dependent variable (more on this in the blog post).
Figure 3 from data scientists at TIKD app. Tickets per
capita as a negative linear function of income. The trend line predicts that as
income goes up, ticket issuance goes down. However, the model under-predicts
all high incomes (>~$82,000) and even under-predicts low incomes
(<$40,000), showing the influence that data point density can have on linear
least squares regression models.
As we have discussed in class,
least squares can be a good model for some data, but not all data (see above
figures). The problem with least squares regression analysis is that it happens
to be unusually influenced by where large amounts of data points accumulate or
where unusual points scatter in Cartesian space, i.e., the variation in
variance of the data points. The figures above are all good examples of this.
Generally, when data scientists
want to look at how well a model works, they look at the residuals of the
observations and the model’s predictions (data point value-model prediction
value). If there is a pattern that appears due to the model’s application, then
the model probably isn’t the best fit for the data. In other words, the model
could be consistently over-predicting or under-predicting past or future points
of data.
So what exactly is this unusually
long word and what does it have to do with what we mentioned? Well, as
described by Stats
Make Me Cry, heteroskedasticity “refers to the circumstance
in which the variability of a variable is unequal across the range of values of
a second variable that predicts it.” Or, more simply, when the variance of data
is conditioned on some other variable not shown. For the figures above, that
could be some other variable like another hormone (Figure 1), an environmental
source negatively affecting health in the state (Figure 2), the uneven policing
of certain communities (Figure 3), or the cautiousness of drivers in older ages
who just so happen to have higher incomes (Figure 3).
Why is it that heteroskedasticity
gets no love in the classroom then? Well, it’s something that can typically be
ignored when scientists analyze data. That’s because the presence of
heteroskedasticity doesn’t truly bias the fitting of least squares regression,
but it will lead to problems when analyzing variance, like in ANOVA tests. You
can’t assume homogeneity of data without first analyzing heteroskedasticity of
the data. Finally, heteroskedasticity might not get a lot
of love from the life sciences, but it is a huge deal in the social sciences,
especially in economics, and the subfield of econometrics, or the branch of
economics that uses math to study the behavior of economic systems. Economists rely
on homogeneity of data to give reliable confidence intervals and hypothesis
tests. Maybe us life scientists should start looking for this then, too. So
remember, the next time you assume homogeneity of data, make sure you use a logarithmic
transformation to account for heteroskedasticity, or use these other cool
tricks found on page 6
of Richard Williams’s lecture from Notre Dame.
No comments:
Post a Comment