Thursday, March 31, 2016

How to think about Statistics and Confidence Intervals (for a p-value-centric scientist)

Introducing Statistics and Confidence Intervals

Statistics is, to me, man’s way of recognizing that we are imperfect and doing our best to control for it. We try to reduce bias at every level of experimentation, from study design to statistical analyses, but because this is a man-made technique of reducing man’s impact on the work that we do as scientists, it is only as effective as we are. It is the same as a computer- a computer is only as powerful and smart as the person who is running it. As such, we need to make ourselves as unbiased and as well-educated as possible in order to trust the conclusions that we draw. It is easy (and only human) to overlook many of the possible variables and situations that can cause our data to look a certain way that have nothing to do with the experimental treatment that we wish to test (and many times, that which we think we are successfully testing!).

The problem with statistics is that many times, we think we know more than we do. We are overconfident in our hypotheses and in our conclusions, and we yell on top of the data (with asterisks) instead of letting the data speak for itself. It is not enough to execute a well-designed experiment. It must be interpreted correctly as well in order to make inferences about the world around us, which is the ultimate goal of experimentation. For example, the p value is touted as the “end-all-be-all” of scientific (statistical) significance. If p<0.05, then we conclude that our treatment is working and we should get a Nature paper. However, in many cases, these small p values still beg the question, WHO CARES? If something is statistically significant, it does not mean that it is clinically relevant. Additionally, the scientific community receives (or should receive) a lot of flak for the weight they give to p values, when in fact what we should be reporting most of the time is a confidence interval. The confidence interval is intimately related to the p value, but it gives far more information and is a more accurate and informative description of the data. People do not understand p values and many times, they do not stop to think closely enough about confidence intervals either. Below are two graphs I have selected from a biostatistics lecture by Patrick Breheny illustrating the differences that result from your choice of confidence level and how they are intuitively very simple, if one takes the time to think about them…
Now, one of these graphs shows a 95% confidence interval, and the other shows an 80% confidence interval. If you think about just the values, you would (wrongly) assume that an 80% confidence interval is “worse” than a 95% confidence interval because 80 is less than 95. However, the definition of a confidence interval is that there is a X% chance that your interval contains the population mean. So, in order for you to be more sure that your interval will contain the true population value, you must widen the interval. Therefore, a 95% confidence interval is actually larger than an 80% confidence interval, but you are more confident that it contains the true population mean. Understanding this somewhat simple but very important concept is essential to generate and interpret scientific data. This course has illustrated this concept and the importance of statistics very well and I will make sure to keep this in the back of my mind throughout my career.


  1. I definitely agree that significant and clinically relevant are two things that are often related. However, most experiments are not done in the context of the environment that we would expect to translate the results (ex: tissue culture vs human model, growing bacteria in nutrient rich broth vs infection, etc).

    Additionally, the confidence interval is another great way to represent data, but again, I don't think we can rely on just one single statistical measure. It greatly depends on the experiment you choose to do and the data you hypothesize to obtain. A lot of papers fail to even identify what error bars we are looking at (SD, SEM, CI?) and we, as scientists, fail to draw conclusion about the data on our own.

    I agree, I think I would be much more comfortable interpreting a confidence interval over a p-value any day.

  2. I think your post highlights how vocabulary and connotation is extremely important whenever anyone discusses statistics. You say that an 80% CI is thought of as 'worse', and how that can cause confusion. Of course an 80% CI looks better, but makes any inference less sure, increasing Type I error.

    What 'science' says is worse and what scienTISTS say is worse is often conflicting. I have had many senior scientists tell me that all their bar graphs display S.E.M instead of SD because S.E.M is smaller, end of story. We of course now know how wrong that is, but it's all about optics.
    Let's be honest, who actually looks closely at the figure legend? Of those who do, who will intuitively integrate whether they are seeing error bars reflecting CI, SEM or SD?

    To the authors, what's worse is whatever is biggest, but that leads to some bad stats, especially when the error bars are not specified (as Ashley mentioned).

  3. I am intrigued by the concept of shifting the focus to CI's instead of the p-value. As Nathan brought up, not many people will 1)focus on the figure legend or 2) understand when it is appropriate to utilize SEM vs. SD. vs. CI. If we do shift to a mindset where we focus on CI's, is there an effective way to get the whole scientific community on board? I worry that with the confusion of the "meaning" of CIs (as illustrated by your example), utilizing CIs over p-values may lead to the same type of overreaching conclusions that are present in current literature.
    As a MD/PhD, I am very much in agreement of asking the question of how a statistically significant result actually translates to the clinic. Unfortunately, I cannot think of a way to standardize clinical significance to compare to statistical significance. As Ashley pointed out, there are a lot of variables that can effect how translatable a result is. For example, lets say in a rodent study they found a drug that substantially reduces inflammation in the brain following head trauma. However, in human patients, there is an issue of how to actually get the drug to pass through the blood brain barrier. Though the initial finding seems extremely clinically relevant, there is a whole other biological problem to address to make it clinically appropriate. I would be intrigued to see what suggestions others would have regarding standardizing the clinical applicability of basic science.

  4. I appreciate you bringing up the importance of CIs and starting this discussion. I want to share another perspective on the relationship between P-values and CIs, and particularly on the utility of CIs. Often 95% is chosen as the CI as it is supposedly analogous to an alpha of 0.05. Hence if there is a 95% chance that this interval contains the population mean there is a 5% chance that we made a type 1 error. Now, ideally we want to minimize the chance of a type 1 error so we would think choosing a 99% CI may be better. After all, that would mean we only have a 1 % chance of making a type 1 error. However, even though we are more confident that our CI contains the mean, what are we able to learn from this data? A 99% CI is going to be often much larger than our 95% and if our goal is to use the CI as a descriptive statistic so as to get an estimate of where the population mean truly lies then we eventually get diminishing returns with increasing the CI. To bring it to an extreme, a 100% CI, while we would have a 0% chance of making a type 1 error, should contain the entire set of all possible values, which provides no utility. The strength of increasing CI I believe comes in when using them to compare either two sample means or a sample mean to a known population mean. Here increasing the CI would lead to us being more confident that there is a difference between the two samples or the sample from the population should the CIs not overlap for two samples or the population mean not be contained within the CI of the sample.

    As an additional point in response to comparing the use of CIs to error bars, as you mention, decreasing the CI to 80% is a smaller interval, but I would argue it is worse than a 95% CI, as we have increased the chance of type 1 error as Nathan mentions.
    It is important to remember that a smaller range generated by an 80% CI is not analogous to a smaller range for a 95 % CI. Rather it is more similar to choosing error bars of less than 1 SD rather than the traditional 1 SD.