When it comes to fitting models in data, we need to be careful avoiding fancy mistakes. The regression functions can get pretty complicated, which matches any of our wish for letting the data be more explanatory. At this point, we could over-interpret the data, and neglect that we can replace the models with other analysis that make more sense in our research context.
I just finished my honor thesis and I had to persuade myself for not playing around what I have learn from this advanced statistic course for an undergraduate student. The research that I worked on is about whether a plant, GBL, can inhibit growth of a bacteria, ATCC6919. We are interested in this bioactivity because it can be an alternative cure to infection caused by this bacterium.
Extractions of this plant were made from 3 different parts of the plant, leaves, branches, and seed, and two different extract solvent was used: ethanol and water. The question that my data analysis need to answer is not only whether the GBL extracts is active against ATCC6919 growth (% inhibition> 50%), but also whether the tree parts and the extract solvent contribute to effectiveness of the extracts.
The ATCC6919 culture was treated with GBL extracts at a range of concentration. So the result of the antibacterial investigation will generate many dose response curves, like the one shown below. Since the trend of the plot is clear that at the percentage inhibition is higher at higher extract dose, there is a pretty good chance that we can find a regression model which fit most of the data well. I could build a “dose response regression model” for the extracts. However, I recalled that we were interested in finding the extracts that were active (% inhibition > 50%). Therefore, a regression model could be a statistically perfect fit, but it is scientifically non-sense. I have to discard the idea of nonlinear regression curve model.
Then I thought about, could I compare the difference in inhibition result from the two extraction solvents by comparing the fit of the data to two models. The best-fit slope of the regression line should be the differences between two group means. Thus, I set the variable defines extraction method X, and assigned X=1 arbitrarily to aqueous extracts and X=2 arbitrarily to ethanolic extracts. Y axis was the percentage inhibition of the extracts at same concentration. It would look like the linear regression graph shown below. However, if so, I neglected the other factor, which is the tree parts, which can also contribute to the difference in inhibition. I could meet a problem opposite to over-fitting the data, which is over-simplify it.
If we replace the regression models for two-way ANOVA, it is easier to see whether plant parts, or the extraction method, or the interaction of them make inhibition of the extracts differ. If you want it to be more basic, multiple student t-tests would work together, too.
To sum up, when we try to fit the models to data, before thinking about which certain type of regression model fit better, check if other (and simpler) method fits more.