When it comes to fitting a model to your data, it is
important to remember that just because you have a good R2 value
(~1.0), that doesn’t mean that your model is the best at estimating your
actual population.
Looking at similar data sets and studies can help you
determine if your model and R2 are consistent with what has been
done in your field. Jim Frost discusses how physical processes tend to be
predictable, have low variation, and a high R2 value. However, if
you are a psychologist and study human behavior, which is highly variable, a
high R2 value could indicate that your model, and therefore your
sample size, does not best represent the population data.
Frost continues by addressing a few possibilities
that may explain a high R2 value. For example, an R2
value is already biased because it is based on your sample data. Another reason
that your R2 may be too high is that you’re trying to fit too many
models to your data just to find the perfect fit. As we’ve discussed in class,
it is very important to pick your statistical model before you perform any
experiments.
A very important problem
encountered with fitting a model to data is the issue of “overfitting”. This
means that your model is too complicated for your data set. If you think about,
you can probably force any equation to perfectly fit your data. But remember,
your data are collected from a sample that is generally meant to represent a
population. An overly complicated model may not accurately predict future data
points. Overfitting can cause your statistical values to be misleading. Having
a large sample size can help overcome this problem and allow for better
modeling of complex parameters. Remember, as Frost said in an accompanying
article:
“The more you want to learn, the larger your sample size must be.”
I haven’t performed any
experiments in my laboratory that require fitting a model to data, so I would
be interested in knowing if any of you have come across these issues. Do you
come across R2 values that are uncommon for your field? Is
overfitting a common issue in your lab?