## Tuesday, April 12, 2016

### Fitting the right model to data

When it comes to fitting models in data, we need to be careful avoiding fancy mistakes. The regression functions can get pretty complicated, which matches any of our wish for letting the data be more explanatory. At this point, we could over-interpret the data, and neglect that we can replace the models with other analysis that make more sense in our research context.

I just finished my honor thesis and I had to persuade myself for not playing around what I have learn from this advanced statistic course for an undergraduate student. The research that I worked on is about whether a plant, GBL, can inhibit growth of a bacteria, ATCC6919. We are interested in this bioactivity because it can be an alternative cure to infection caused by this bacterium.
Extractions of this plant were made from 3 different parts of the plant, leaves, branches, and seed, and two different extract solvent was used: ethanol and water. The question that my data analysis need to answer is not only whether the GBL extracts is active against ATCC6919 growth (% inhibition> 50%), but also whether the tree parts and the extract solvent contribute to effectiveness of the extracts.

The ATCC6919 culture was treated with GBL extracts at a range of concentration. So the result of the antibacterial investigation will generate many dose response curves, like the one shown below. Since the trend of the plot is clear that at the percentage inhibition is higher at higher extract dose, there is a pretty good chance that we can find a regression model which fit most of the data well. I could build a “dose response regression model” for the extracts. However, I recalled that we were interested in finding the extracts that were active (% inhibition > 50%). Therefore, a regression model could be a statistically perfect fit, but it is scientifically non-sense. I have to discard the idea of nonlinear regression curve model.

Then I thought about, could I compare the difference in inhibition result from the two extraction solvents by comparing the fit of the data to two models. The best-fit slope of the regression line should be the differences between two group means. Thus, I set the variable defines extraction method X, and assigned X=1 arbitrarily to aqueous extracts and X=2 arbitrarily to ethanolic extracts. Y axis was the percentage inhibition of the extracts at same concentration. It would look like the linear regression graph shown below. However, if so, I neglected the other factor, which is the tree parts, which can also contribute to the difference in inhibition. I could meet a problem opposite to over-fitting the data, which is over-simplify it.

If we replace the regression models for two-way ANOVA, it is easier to see whether plant parts, or the extraction method, or the interaction of them make inhibition of the extracts differ. If you want it to be more basic, multiple student t-tests would work together, too.

To sum up, when we try to fit the models to data, before thinking about which certain type of regression model fit better, check if other (and simpler) method fits more.

1. It is important to determine what statistical test you are going to run before you begin the experiment. This is related to the experimental set-up that you plan to use. I think you're confusing "fitting a model to data" and picking a statistical model.

I'm not sure, but agree that using an ANOVA will help you to determine differences across groups (ex: seed to leaf, [conc]1 to [conc]2, etc).

(Also, ethanol will inhibit growth on its own, so make sure you're running the right controls i.e.: ethanol only).

2. Huang, that isn't scientific nonsense at all. Your extracts have widely different potencies, which is yuuuuge! Pharmacologists like me are big fans of different potencies.

Transforming data to a common ceiling (100% inhibition) washed away your ability to compare the maximal effects, which apparently interest you more. Go back to the original data...don't do the %max transformation (you must have missed the class where I warned against that....)

1. It is good to know that the different potencies could be interesting.

But I think I did not transform my data here, since I made the graph straight from plotting extract concentration vs. percentage inhibition.

Because we were using the OD read to estimate turbidity of bacteria culture before and after treatment, the percentage inhibition can be 100% if there is no change in OD read. Please correct me if you think I did not spot where I transformed my data.

3. Huang, that isn't scientific nonsense at all. Your extracts have widely different potencies, which is yuuuuge! Pharmacologists like me are big fans of different potencies.

Transforming data to a common ceiling (100% inhibition) washed away your ability to compare the maximal effects, which apparently interest you more. Go back to the original data...don't do the %max transformation (you must have missed the class where I warned against that....)