Thursday, March 31, 2016

Problems with R-squared and Overfitting Models

When it comes to fitting a model to your data, it is important to remember that just because you have a good R2 value (~1.0), that doesn’t mean that your model is the best at estimating your actual population.

Looking at similar data sets and studies can help you determine if your model and R2 are consistent with what has been done in your field.  Jim Frost discusses how physical processes tend to be predictable, have low variation, and a high R2 value. However, if you are a psychologist and study human behavior, which is highly variable, a high R2 value could indicate that your model, and therefore your sample size, does not best represent the population data.

Frost continues by addressing a few possibilities that may explain a high R2 value. For example, an R2 value is already biased because it is based on your sample data. Another reason that your R2 may be too high is that you’re trying to fit too many models to your data just to find the perfect fit. As we’ve discussed in class, it is very important to pick your statistical model before you perform any experiments.

A very important problem encountered with fitting a model to data is the issue of “overfitting”. This means that your model is too complicated for your data set. If you think about, you can probably force any equation to perfectly fit your data. But remember, your data are collected from a sample that is generally meant to represent a population. An overly complicated model may not accurately predict future data points. Overfitting can cause your statistical values to be misleading. Having a large sample size can help overcome this problem and allow for better modeling of complex parameters. Remember, as Frost said in an accompanying article:
The more you want to learn, the larger your sample size must be.”

I haven’t performed any experiments in my laboratory that require fitting a model to data, so I would be interested in knowing if any of you have come across these issues. Do you come across R2 values that are uncommon for your field? Is overfitting a common issue in your lab?

1 comment:

  1. Good stuff.

    Most of the regression I've done is to fit models to data sets with the purpose of deriving estimates of the model parameters. The model parameters have biological meaning. The models I choose are driven by my understanding of the biological system. The parameters then become a random variable I can track while I manipulate the system to explore how it functions.

    As you say, more perfect models can always be written to fit the data better and better. But what do you have in the end? Well, you have a beautiful equation, and a high r^2, but does it have any real utility in your work?

    There isn't any statistic that tells you've overwritten a model. Sometimes the good simple, parsimonious models yield high r^2. Sometimes over-written models do the same.

    The way to know this is by looking at the model parameters and realizing you can't make any sense of one or more of them, biologically.

    For example, y =f(a, b, c, d, e, x), where x is your explanatory variable. In other words, 5 distinct modifiers of x are responsible for your effect level. If any of a, b, c, d, and e don't correspond to some specific biological attribute in your system (eg, a rate, an affinity or a some biological relationship), then as beautiful as it looks, you don't have a particularly useful model.

    The excess parameters are fudge factors you need to get a more perfect fit of the model to the data, but they don't really tell you anything about the system.

    When we first start working with regression models, we think the analysis and output is going to reveal hidden treasures of insights into our data and tell us what it all means.

    In fact, that is a data exploration mindset. Which is fine, in early stages of a project where the idea is to poke around and see how it functions, but that mindset is sailing on a sea of bias.

    At some point, the explorer needs to shift their thinking to a more hypothesis-driven frame of mind. The regression model they choose is driven by their understanding of how the system operates. They're using the model fit, the value of the model parameters, to test a variety of other interesting questions. The model, in effect, becomes the bread and butter assay result.

    Shorter: Only our expertise with the system can tell us which regression parameters are fudge factors, and which represent relevant biological functionality.