Wednesday, April 6, 2016

Finding the Right Fit

The idea behind fitting a model is to find the best-fit values of the parameters that define the model. According to Motulsky, a common mistake in statistical models is trying to find a “perfect model” or overfitting. The goal is of a model is not to perfectly describe the data perfectly, since it may have too many variables and parameters to be useful. Nonlinear regression models can fit any model that defines Y as a function of X, and of course as the name suggests, the relationship between Y and X can be curved.

In the example above we are provided with census data that shows the U.S. population in the left, and on the right we can see the different model fits ranging from exponential all the way to a sixth degree polynomial. Most models seem to fit the data, however if we extrapolate the best fit to predict future population values, the behavior of the sixth-degree polynomial fit beyond the data range makes it a poor choice for extrapolation and this fit can be rejected without a need to calculate goodness of fit (unless we are headed to complete and total annihilation).

In this last example, researchers set out to model the growth of the native Mexican turkey in order to estimate the maximum instant growth period in order to market and sell at optimal weight. Previous literature had shown that the Richards model (in red) was the most appropriate function to estimate growth curves in poultry, however this study suggests that in this particular case a 4th order polynomial (in blue) is better at estimating the maximum growth period.

Which model is right? and which one is wrong? This goes back to the issue of overfitting, a model with too few parameters won't fit the data well, too many and it will but the confidence intervals will be wide. If the goal of the model is to predict future values, a model with too many parameters won't do it well, and if the goal is to interpret values scientifically, the CIs will be too big. And finally as something that is only tangentially related to fitting models:

Sources: Census data polynomial curve fitting, Mexican Turkeys, Doge, and Motulsky.


  1. In response to your question, "which model is right and which model is wrong," I would direct you to one of my favorite quotes by George E. P. Box which says, "all models are wrong, but some are useful." In fitting models to data, I don't believe the goal is to find a "right" model, but one that is most useful to you.

    As evident in the graphs you included on census data, the sixth-order polynomial is likely not the model that would be most useful to you if your goal was to predict the future trends in US population change. However, if your goal were to accurately describe and analyze past trends in US population change, the sixth-order polynomial may be the best model to use.

    Overall I think which model you choose to use in a given situation deserves a lot of thought. You need to recognize what goal you wish to achieve with your analysis. I think in the end, however, the mantra that "all models are wrong, but some are useful" should be kept in mind.

  2. I thought that your post showed the importance of choosing the right model for your data. With your US population example, based on which model you choose, you can theoretically find a model that will give you exactly the result you want to see. Maybe the researchers wanted to show that the US population is essentially doomed in the future; as a result, the sixth degree polynomial fit is the perfect model.

    I think this post also shows that it is important to plan out what statistical tests or models you want to use before actually performing your experiment. Otherwise after getting your data, you can just play around with the different models in order to get the results that you like the best, which is introducing bias into your study.