Tuesday, April 12, 2016

Optimizing the BOT

Prior to taking this class, I had a lengthy conversation with my PI about statistics.
We debated over what statistical methods were appropriate to use for our experiments. She opted for the classic t test and I opted for anything but that.
During this debate she would often throw out statements like,
“We don’t base our conclusions solely on whether or not something is significant.”
“We should be able to tell if a result is significant or not just by looking at the data.”
“We can’t publish without stats.”
“Even if a result is significant, it doesn’t matter if it doesn’t have any biological relevance.”
Looking back on this debate now, I realize my PI was/is a follower of BOT.
BOT standing for the “Bloody Obvious Test” coined back in 1987 by Ian Kitchen. Kitchen noted that there was pressure from journals to use statistics and that p-hacking was a problem. 

                                                but it does seem that too often we
labour over their (statistics) use unnecessarily
and indeed on other occasions we
manipulate them to prove a very
thin point.” –Ian Kitchen

Because of these issues, Kitchen proposed the use of the “Bloody Obvious Test. ”
The protocol for the BOT is as follows:
                Question #1: Is it bloody obvious that the values are different?”

                  Answer: Yes.  The test is positive, proceed to “Go” and collect $200.
                  Answer: No.   Proceed to question number 2.

                Question #2: “Am I making a mountain out of a molehill?”

Kitchen really wanted to drive home the point that statistics were being abused to appease “the gods of statistics” who happened to frequently sit on journal review boards. He wanted to remind scientists that sometimes the easiest and most obvious answer is the right answer. Lastly, he wanted scientists to recognize that statistical significance doesn’t always equal scientific significance.

Sadly, Kitchen didn’t stop these issues from persisting in science today. Scientists are still appeasing “the gods of statistics” because to be successful in science, you have to publish.

As the reality of science publishing seems unlikely to change and the pressure to include stats continues, I propose we optimize the BOT with confidence intervals.

Confidence intervals are a form of statistics that provides a range in which the true population value may lie. Traditionally, we set confidence intervals at 95%. A 95% confidence interval tells us that there is a 95% percent chance that confidence interval contains the true population parameter of interest.

The addition of CIs would add a statistical robustness to the BOT, that would perhaps appease “the gods of statistics.” Also, the addition of confidence intervals wouldn’t detract from the initial step of the BOT. We could still ask Question #1 without a pesky p-value getting in the way of our conclusion. Instead, confidence intervals would be to the BOT “as a drunk uses a lamp-post; for support rather than illumination.


  1. Part of what drives publishers away from allowing the bloody obvious test seems to be the same factor that leads to seeing the same names in the top journals again and again. Humans like patterns and categories that save them thought, and once someone has their breakout finding (often meeting the bloody obvious test with flying colors) then their later findings are seen with less skepticism. This lowered bar of skepticism seems to lead to increasingly thin points based on increasingly tortured p hacking to make small biological effects "publishable". I honestly wonder if we would see less aggregation of labs at the top and start seeing better, more biologically sound papers if the review process were truly blinded. Though obviously this brings up it's own practical concerns, as when you know the lab that does "x" in the field and you see a paper on "x" come up, you can generally guess who it is from.

  2. I like the BOT, and I think it reflects how I actually read scientific papers. When I look at figures, I tend not to look for asterisks or statistical significance. Instead, I look for interesting effect sizes without error bar overlap. I know it probably isn't the most systematic or unbiased approach, but I don't think p-values are that unbiased either. I'm also surprised how often authors will ignore interesting looking differences just because they are not statistically significant. You miss so much data by ignoring non-statistically significant data sets.

  3. I think we have to deal with the fundamental issue that numbers and decimals cannot 100% effectively determine or describe complex questions which we are grappling with. I often think about the role of intuition or expertise. If we can really just set up an optimal experiment and randomize everything perfectly and ignore chaos theory and then properly designate a statistical test, then why have grad students not been mechanized yet? We learn about stats and we read about them in papers, but really stats seem like a standardized and imperfect way of communicating something about the nature of our data. However, in the end, we should realize that randomization, the proper statistical tests, and all that jazz can only deal with what data we give them, or within the experimental parameters we set. As scientists our expertise and intuition are what will inevitably save us from costly or unethical mistakes.