Nowadays, it is nearly impossible to find a published scientific article that does not report at least one p-value. Although there is nothing inherently wrong with the p-value statistic, the problem lies in how this statistic is used. Within the scientific community, there is a fundamental misunderstanding of p-values and what conclusions can be drawn from them. The statistically illiterate scientist thinks that p-values are an indication of the probability that the null hypothesis is true or, even worse, that their experimental result is true. In fact, however, the p-value is a statistic used to describe the probability of getting the experimental result (or something more extreme), given that the null hypothesis is true. Unfortunately, most researchers and journal editors incorrectly link low p-values with solid experimental design. This incorrect interpretation biases the findings from experiments, making them look more grand and inflated. Worse of all, some experimenters mistakenly use p-values as a measure of effect size. For some unknown reason, these scientists equate lower p-values with bigger effect sizes. For example, a clinical trial for an anti-cancer drug as a p-value of 0.001 for a two-sample t-test. However, this drug only decreases the probability of cancer remission by 3%. On the other hand, another anti-cancer drug has a p-value of 0.04 after a two-sample t-test, but it decreases remission probability by 35%. The statistically challenged scientist would report the drug with the lower p-value has the better drug, even though it only had a 3% effect compared to placebo. Using this logic, it's not hard to imagine that statistical illiteracy and the demand for 'pretty' p-values from journal editors has likely lead to unnecessary human suffering and even possibly death.
This p-value epidemic is a growing problem. In a landmark study conducted by Dr. John Ioannidis at Stanford university, his team found that after scanning millions of abstracts in Pubmed, p-value reporting in abstracts has increased from 7.3% in 1990 to 15.6% in 2015. Additionally, they found that nearly 55% of abstracts about randomized clinical trials contained at least one p-value. These are staggering and alarming numbers considering the scientific community's tenuous grasp of p-values. Perhaps the most shocking discovery from this article was that 96% of abstracts that reported a p-value, reported a 'statistically significant' p-value of less than 0.05. This means that of the millions and millions of hypotheses tested, 96% of them were significant. That seems impossible to me.
In order to avoid bias and report sound science, Dr. Ioannidis suggests using Bayes factors or false-discovery rate statistics. In my opinion, these two statistics are too foreign for most researchers and will likely be misused out of confusion. Instead, I think the easiest way out of the p-value crisis is to start reporting confidence intervals (CIs). The CI is a way to measure uncertainty about the magnitude of the effect. Essentially, if the 95% CIs for treatment versus control do not overlap, the experimental design is sound and the results of the experiment look promising. By using CIs, good science can be weeded out from crappy science very quickly, hopefully restoring some validity to our work.
No comments:
Post a Comment