Sunday, April 3, 2016

Chicken or Egg?

Every discipline seems to have its own version of the chicken or egg debate. For statistics, the debate could be: Which came first, the model or the data?

The answer would initially appear to be quite obvious. The data must come first, of course. As Harvey Motulsky states in Intuitive Biostatistics, “Regression does not fit data to a model…rather, the model is fit to the data.” In other words, the data are used to calculate the parameters of the model. That basic premise is quite clear.

Where it gets tricky, however, is in the selection of which type of model to use. There are linear, logarithmic, dose-response curves, binding curves, higher order polynomials, etc. To navigate these many options, Motulsky again has sage advice: “Choosing a model should be a scientific decision based on chemistry (or physiology, or genetics, etc.).” So if your data are measurements of radioactive decay, you know that you should use an exponential decay model. Or if your data are measurements of the effect of a drug, you know that you should use a logarithmic dose-response model. Even if the R2 value for a different model were higher, it would be inappropriate to try to fit your data to a model that you know does not make sense biologically, chemically, etc.

But that is where the chicken or the egg question comes in to play. How does the initial model for a particular biological or chemical system get established? Surely someone, at some point, had to try several models and find the one that fit best for that type of data. If no one has ever established a model for a particular system, you cannot follow the statistical best practice of selecting a priori that you will fit a particular type of model to your data (as in (A) in the figure below). Instead you have to try several types of models. Rather than using the data to determine the best-fit values for parameters of a model, you are now using the data to determine which parameters even need to be calculated in the first place. Certainly this process will still be driven by the data, and rigorous statistical tests can be applied to determine which model fits the data better. Nevertheless, as the figure below demonstrates, this situation (B) fundamentally alters the workflow needed to test your hypothesis.  Once the appropriate model for a system is established, the workflow can return to that outlined in (A), but the initial need to establish a model flips the data-model relationship somewhat.

So although the data do always inform the model (and a model is always fit to the data, not vice versa!), there are situations where the data come first and situations where the model comes first. As with most chicken or egg debates, perhaps this one does not have a definitive answer either.  


  1. I think this is a very interesting question to address! As we have been learning about regression in class, I keep questioning "but how was this model established?!" Though it is very convenient for us to have numerous options in GraphPad of the types of models we can apply to our data, I cannot help but to think about the origin of those models. At some point in history, someone had to use their best judgement in designing the model. As you point out, it is easy to know which model to use for some experimental approaches like radioactive decay, but when you are working on a novel approach, it may be hard to know how to analyze your data.
    Since I do not have a full grasp on selecting models for regression, I feel uncomfortable relying on it entirely. Luckily, with my experiments, I can primarily utilize ANOVA. However, I think it is important to address this potential for bias and confusion in models. By bringing attention to this "chicken or the egg" issue, some may able to think more about their approach to using models and avoid bias.

    1. When we were in high school and college taking geometry and calculus equation parameters were pretty abstract and mostly discussed in the context of how the parameter changed the shape of the curve.

      In regression applied to biological systems, we have to think of those parameters as an index of some biological process.

      For the "neurite sparks" assignment that B2 parameter is an index for a process that gradstudin controls that defines the shape of the response.

      The math can't tell you what that process is in your model, but it does tell you it is important.

  2. This is a really interesting take on what I have conceptually struggled with in class. We are taught that the data fit the model, but also that the model is made in order to fit the data that we have. The model is "perfect", but the data can be influenced by sampling and, perhaps but hopefully not, by bias. There is a strange catch-22 here that you capture succinctly with your analogy to the chicken and the egg debate. Because most of us have not be trained formally in statistics besides this course and perhaps 1 or 2 others, we have to take what previous statisticians and mathematicians have worked out as what is correct. However, whenever there is a problem or a choice that we have to make between two similar models, we can't ask Gosset or Fisher for help. In these situations, we need to have at least enough of a working knowledge of WHY we are making the statistical choices that we are making so that we can interpret the parameters of the model. As with most techniques, we need to take what other people have worked out and mold it to our own system and what fits our experiments.