Friday, April 8, 2016

Interpret the P-values and Statistical Significance Properly

The p-value is probably the most commonly used reference for decision-making in biological and biomedical sciences: A result with a p-value less than 0.05 is valuable, while a result with p-value greater than 0.05 is worthless. Clear-cut, right? Unfortunately, no. A p-value cannot tell if an experimental result is worthy or not. It cannot tell if the original research question is addressed. It even fools someone when one obtains some p-values greater than 0.05 while others less than 0.05 from different tests. To interpret the p-value properly, we may start by asking: ‘What is p-value designed for?’

Let’s say we want to know whether a coin is fair, i.e., the probabilities of getting a head (H) and getting a tail (T) are equal to 0.5. One flips the coin 10 times, and gets a result

H, H, T, T, H, T, T, H, H, H.

Most people would say this coin is fairly fair, although the number of head is not exactly 5. However, if the one gets a result

H, H, H, H, T, H, T, H, H, H.

Now some people become indecisive: Does this result show the coin is not fair, or this seemly extreme case occurs simply by chance? An intuitive way that helps on decision making is to calculate the chance of observing this result or more extreme cases given the coin is fair and compare this probability to a pre-defined threshold. 

If the threshold we employ is 0.1, then the hypothesis of fair coin is not true since it yields the chance of extreme cases lower than the threshold. This is the exact p-value, a p-value calculated from the distribution of interested random variable. Most of the tests, like Z-test and Pearson’s chi-square test, calculate the approximate p-value, which is based on the same idea but using another distribution to approximate the distribution of the test statistic.

Either exact or approximate p-values, the concept is clear: P-value is the probability that one observes the data under null hypothesis. This brings us two things. First, a complete hypothesis testing includes four parts: the null hypothesis, the alternative hypothesis, the threshold (i.e., the alpha), and the test statistic with corresponding probability distribution. All of four parts affect the p-value and statistical inferences. Second, a small p-value does not imply the scientific hypothesis is true. It still can occur by chance, especially when multiple comparison is conducted. Besides, the statistical hypotheses may depart from the scientific hypothesis by employing a surrogate maker (e.g., using blood low-density lipoprotein level as a surrogate for risk of cardiovascular disease), measurement error, or data transformation (e.g., transforming a continuous response to binary).

P-value is widely used, sometimes it’s abused. The best way to interpret the statistical significance reported on research papers is asking ‘What is the hypothesis they are testing? Why did they choose this procedure? Is there other applicable tests? How far is the testing procedure away from their research question?’. Just keep cautious: P-values can fool you.

1 comment:

  1. Great post, Tiger. You have a good way of breaking things down and explaining them well so that people who want to understand statistics but aren't the best at it can follow along with you.

    When reading papers, I usually assume the writers know what they're doing in terms of methods and analysis, so I've always just trusted that what they find significant is correct. It's becoming clearer that skepticism is a better way to approach others' work, including questioning whether they appropriately tested their hypothesis