Every discipline seems to have its own version of the
chicken or egg debate. For statistics, the debate could be: Which came first,
the model or the data?
The answer would initially appear to be quite obvious. The
data must come first, of course. As Harvey Motulsky states in Intuitive Biostatistics, “Regression
does not fit data to a model…rather, the model is fit to the data.” In other
words, the data are used to calculate the parameters of the model. That basic
premise is quite clear.
Where it gets tricky, however, is in the selection of which
type of model to use. There are linear, logarithmic, dose-response curves,
binding curves, higher order polynomials, etc. To navigate these many options,
Motulsky again has sage advice: “Choosing a model should be a scientific
decision based on chemistry (or physiology, or genetics, etc.).” So if your
data are measurements of radioactive decay, you know that you should use an
exponential decay model. Or if your data are measurements of the effect of a
drug, you know that you should use a logarithmic dose-response model. Even if
the R2 value for a different model were higher, it would be
inappropriate to try to fit your data to a model that you know does not make
sense biologically, chemically, etc.
But that is where the chicken or the egg question comes in
to play. How does the initial model for a particular biological or chemical
system get established? Surely someone, at some point, had to try several
models and find the one that fit best for that type of data. If no one has ever
established a model for a particular system, you cannot follow the statistical
best practice of selecting a priori that
you will fit a particular type of model to your data (as in (A) in the figure
below). Instead you have to try several types of models. Rather than using the
data to determine the best-fit values for parameters of a model, you are now
using the data to determine which parameters even need to be calculated in the
first place. Certainly this process will still be driven by the data, and
rigorous statistical tests can be applied to determine which model fits the
data better. Nevertheless, as the figure below demonstrates, this situation (B)
fundamentally alters the workflow needed to test your hypothesis. Once the appropriate model for a system is
established, the workflow can return to that outlined in (A), but the initial
need to establish a model flips the data-model relationship somewhat.
So although the data do always inform the model (and a model
is always fit to the data, not vice versa!), there are situations where the
data come first and situations where the model comes first. As with most
chicken or egg debates, perhaps this one does not have a definitive answer
either.
I think this is a very interesting question to address! As we have been learning about regression in class, I keep questioning "but how was this model established?!" Though it is very convenient for us to have numerous options in GraphPad of the types of models we can apply to our data, I cannot help but to think about the origin of those models. At some point in history, someone had to use their best judgement in designing the model. As you point out, it is easy to know which model to use for some experimental approaches like radioactive decay, but when you are working on a novel approach, it may be hard to know how to analyze your data.
ReplyDeleteSince I do not have a full grasp on selecting models for regression, I feel uncomfortable relying on it entirely. Luckily, with my experiments, I can primarily utilize ANOVA. However, I think it is important to address this potential for bias and confusion in models. By bringing attention to this "chicken or the egg" issue, some may able to think more about their approach to using models and avoid bias.
When we were in high school and college taking geometry and calculus equation parameters were pretty abstract and mostly discussed in the context of how the parameter changed the shape of the curve.
DeleteIn regression applied to biological systems, we have to think of those parameters as an index of some biological process.
For the "neurite sparks" assignment that B2 parameter is an index for a process that gradstudin controls that defines the shape of the response.
The math can't tell you what that process is in your model, but it does tell you it is important.
This is a really interesting take on what I have conceptually struggled with in class. We are taught that the data fit the model, but also that the model is made in order to fit the data that we have. The model is "perfect", but the data can be influenced by sampling and, perhaps but hopefully not, by bias. There is a strange catch-22 here that you capture succinctly with your analogy to the chicken and the egg debate. Because most of us have not be trained formally in statistics besides this course and perhaps 1 or 2 others, we have to take what previous statisticians and mathematicians have worked out as what is correct. However, whenever there is a problem or a choice that we have to make between two similar models, we can't ask Gosset or Fisher for help. In these situations, we need to have at least enough of a working knowledge of WHY we are making the statistical choices that we are making so that we can interpret the parameters of the model. As with most techniques, we need to take what other people have worked out and mold it to our own system and what fits our experiments.
ReplyDelete