tag:blogger.com,1999:blog-7310946608587805029.post2415895596059330119..comments2024-03-13T01:48:29.943-04:00Comments on Unbiased Research: Problems with R-squared and Overfitting ModelsTJ Murphyhttp://www.blogger.com/profile/17292359594683490598noreply@blogger.comBlogger1125tag:blogger.com,1999:blog-7310946608587805029.post-60834949485114973282016-04-01T06:36:31.725-04:002016-04-01T06:36:31.725-04:00Good stuff.
Most of the regression I've done ...Good stuff.<br /><br />Most of the regression I've done is to fit models to data sets with the purpose of deriving estimates of the model parameters. The model parameters have biological meaning. The models I choose are driven by my understanding of the biological system. The parameters then become a random variable I can track while I manipulate the system to explore how it functions.<br /><br />As you say, more perfect models can always be written to fit the data better and better. But what do you have in the end? Well, you have a beautiful equation, and a high r^2, but does it have any real utility in your work?<br /><br />There isn't any statistic that tells you've overwritten a model. Sometimes the good simple, parsimonious models yield high r^2. Sometimes over-written models do the same.<br /><br />The way to know this is by looking at the model parameters and realizing you can't make any sense of one or more of them, biologically. <br /><br />For example, y =f(a, b, c, d, e, x), where x is your explanatory variable. In other words, 5 distinct modifiers of x are responsible for your effect level. If any of a, b, c, d, and e don't correspond to some specific biological attribute in your system (eg, a rate, an affinity or a some biological relationship), then as beautiful as it looks, you don't have a particularly useful model. <br /><br />The excess parameters are fudge factors you need to get a more perfect fit of the model to the data, but they don't really tell you anything about the system.<br /><br />When we first start working with regression models, we think the analysis and output is going to reveal hidden treasures of insights into our data and tell us what it all means.<br /><br />In fact, that is a data exploration mindset. Which is fine, in early stages of a project where the idea is to poke around and see how it functions, but that mindset is sailing on a sea of bias.<br /><br />At some point, the explorer needs to shift their thinking to a more hypothesis-driven frame of mind. The regression model they choose is driven by their understanding of how the system operates. They're using the model fit, the value of the model parameters, to test a variety of other interesting questions. The model, in effect, becomes the bread and butter assay result.<br /><br />Shorter: Only our expertise with the system can tell us which regression parameters are fudge factors, and which represent relevant biological functionality. <br /> <br /><br /><br />TJ Murphyhttps://www.blogger.com/profile/17292359594683490598noreply@blogger.com