Every discipline seems to have its own version of the chicken or egg debate. For statistics, the debate could be: Which came first, the model or the data?
The answer would initially appear to be quite obvious. The data must come first, of course. As Harvey Motulsky states in Intuitive Biostatistics, “Regression does not fit data to a model…rather, the model is fit to the data.” In other words, the data are used to calculate the parameters of the model. That basic premise is quite clear.
Where it gets tricky, however, is in the selection of which type of model to use. There are linear, logarithmic, dose-response curves, binding curves, higher order polynomials, etc. To navigate these many options, Motulsky again has sage advice: “Choosing a model should be a scientific decision based on chemistry (or physiology, or genetics, etc.).” So if your data are measurements of radioactive decay, you know that you should use an exponential decay model. Or if your data are measurements of the effect of a drug, you know that you should use a logarithmic dose-response model. Even if the R2 value for a different model were higher, it would be inappropriate to try to fit your data to a model that you know does not make sense biologically, chemically, etc.
But that is where the chicken or the egg question comes in to play. How does the initial model for a particular biological or chemical system get established? Surely someone, at some point, had to try several models and find the one that fit best for that type of data. If no one has ever established a model for a particular system, you cannot follow the statistical best practice of selecting a priori that you will fit a particular type of model to your data (as in (A) in the figure below). Instead you have to try several types of models. Rather than using the data to determine the best-fit values for parameters of a model, you are now using the data to determine which parameters even need to be calculated in the first place. Certainly this process will still be driven by the data, and rigorous statistical tests can be applied to determine which model fits the data better. Nevertheless, as the figure below demonstrates, this situation (B) fundamentally alters the workflow needed to test your hypothesis. Once the appropriate model for a system is established, the workflow can return to that outlined in (A), but the initial need to establish a model flips the data-model relationship somewhat.
So although the data do always inform the model (and a model is always fit to the data, not vice versa!), there are situations where the data come first and situations where the model comes first. As with most chicken or egg debates, perhaps this one does not have a definitive answer either.