iii. P values and statistical significance
(Note: names have been changed)
I glared at the data glowing back at me on the computer screen – the postdoctoral fellow who was mentoring me had mentioned finding a p-value for all the numbers I had.
“Sorry, Florence, how did you want me to analyze this data again?”
“Just calculate the SEM for each group, stick the numbers in PRISM, and then look for the p-value and see if the difference is significant.” Then she walked away. Something about a motor neuron prep and her mouse embryonic spinal cords sitting on ice for too long.
I had no idea how to do what she wanted me to do. What is an SEM? Why aren’t we calculating standard deviation instead? Google searches brought up a myriad of statistics websites that attempted to explain how to derive the equation used to calculate an SEM, but with little context as to why you would even want an SEM in the first place. I tried to recall the statistics class I took sophomore year, but only came up with “the p-value indicates significant results.” …Right? I sat there questioning my own competence before continuing to toil over how to do the calculations on the numbers from my qPCR. When I finally got the numbers and graphs, I was dismayed that there was no significance. When Florence came around again, I showed her the results.
“Oh. Well, that sucks. But there seems to be a trend. And we’re only at n=2, so I think if we just increase our n, it’ll probably be significant.”
How many of us have been put in a similar situation or have heard of a situation like this?
Without a strong background or understanding of statistics, I blindly trusted Florence’s logic and choice of statistical analyses – she was a postdoctoral fellow after all. She’s probably done more than two dozen of these kinds of statistics on her own data that granted her her Ph.D. She must know what she’s doing, I reasoned. But that was the danger of scientists who were improperly or inadequately trained to conduct statistical analyses: in hindsight, I realized that 1) few people (or even scientists, me included) actually understand what “significance” really means, and 2) as Motulsky puts it, “once some people hear the word significant, they often stop thinking about what the data actually show.” The scenario I recounted is something Simmons, Nelson, and Simonsohn (2012) termed “P-hacking,” a term that refers to attempts by investigators to lower the P value by trying various analyses or by analyzing subsets of data. Motulsky draws out two ways in which investigators do this: 1) by tweaking data (if one analysis didn’t give a P value less than 0.05, then they tried a different one) and/or 2) by changing the sample size post hoc (stopping data collection if the P value is less than 0.05, but collecting more data when the P value is about 0.05).
One study by Gotzche (2006) looked at comparing the number of publications that reported a P value between 0.04 and 0.06, hypothesizing that if results were published honestly, the number of publications reporting a P value between 0.04 and 0.05 and a P value between 0.05 and 0.06 should be similar. In the analysis, Gotzsche found that there were five times as many papers reporting P values between 0.04 and 0.05 compared to P values between 0.05 and 0.06. The emphasis on statistical “significance” equating as scientific significance ends up skewing the publication of results and data and creates publication bias. I really wonder if some scientists believe that inadvertently p-hacking is a legitimate way to conduct statistical analyses, or if some do it knowing that it is the improper way to generate "significant" results.
Perhaps the fix here is for journals to start requiring authors to submit a short cover note explaining the justification of the utilized statistics to corroborate that they understood why and how the statistical tools were chosen and used. In this way, it could force scientists to not only conduct reliable and properly designed experiments, but also to think more carefully about the interpretation of their results, rather than just trying to force or find significance that might not be there.