When it comes to fitting a model to your data, it is important to remember that just because you have a good R2 value (~1.0), that doesn’t mean that your model is the best at estimating your actual population.
Looking at similar data sets and studies can help you determine if your model and R2 are consistent with what has been done in your field. Jim Frost discusses how physical processes tend to be predictable, have low variation, and a high R2 value. However, if you are a psychologist and study human behavior, which is highly variable, a high R2 value could indicate that your model, and therefore your sample size, does not best represent the population data.
Frost continues by addressing a few possibilities that may explain a high R2 value. For example, an R2 value is already biased because it is based on your sample data. Another reason that your R2 may be too high is that you’re trying to fit too many models to your data just to find the perfect fit. As we’ve discussed in class, it is very important to pick your statistical model before you perform any experiments.
A very important problem encountered with fitting a model to data is the issue of “overfitting”. This means that your model is too complicated for your data set. If you think about, you can probably force any equation to perfectly fit your data. But remember, your data are collected from a sample that is generally meant to represent a population. An overly complicated model may not accurately predict future data points. Overfitting can cause your statistical values to be misleading. Having a large sample size can help overcome this problem and allow for better modeling of complex parameters. Remember, as Frost said in an accompanying article:
“The more you want to learn, the larger your sample size must be.”
I haven’t performed any experiments in my laboratory that require fitting a model to data, so I would be interested in knowing if any of you have come across these issues. Do you come across R2 values that are uncommon for your field? Is overfitting a common issue in your lab?