Every discipline seems to have its own version of the
chicken or egg debate. For statistics, the debate could be: Which came first,
the model or the data?
The answer would initially appear to be quite obvious. The
data must come first, of course. As Harvey Motulsky states in Intuitive Biostatistics, “Regression
does not fit data to a model…rather, the model is fit to the data.” In other
words, the data are used to calculate the parameters of the model. That basic
premise is quite clear.
Where it gets tricky, however, is in the selection of which
type of model to use. There are linear, logarithmic, dose-response curves,
binding curves, higher order polynomials, etc. To navigate these many options,
Motulsky again has sage advice: “Choosing a model should be a scientific
decision based on chemistry (or physiology, or genetics, etc.).” So if your
data are measurements of radioactive decay, you know that you should use an
exponential decay model. Or if your data are measurements of the effect of a
drug, you know that you should use a logarithmic dose-response model. Even if
the R2 value for a different model were higher, it would be
inappropriate to try to fit your data to a model that you know does not make
sense biologically, chemically, etc.
But that is where the chicken or the egg question comes in
to play. How does the initial model for a particular biological or chemical
system get established? Surely someone, at some point, had to try several
models and find the one that fit best for that type of data. If no one has ever
established a model for a particular system, you cannot follow the statistical
best practice of selecting a priori that
you will fit a particular type of model to your data (as in (A) in the figure
below). Instead you have to try several types of models. Rather than using the
data to determine the best-fit values for parameters of a model, you are now
using the data to determine which parameters even need to be calculated in the
first place. Certainly this process will still be driven by the data, and
rigorous statistical tests can be applied to determine which model fits the
data better. Nevertheless, as the figure below demonstrates, this situation (B)
fundamentally alters the workflow needed to test your hypothesis. Once the appropriate model for a system is
established, the workflow can return to that outlined in (A), but the initial
need to establish a model flips the data-model relationship somewhat.
So although the data do always inform the model (and a model
is always fit to the data, not vice versa!), there are situations where the
data come first and situations where the model comes first. As with most
chicken or egg debates, perhaps this one does not have a definitive answer
either.