Thursday, April 7, 2016

P-hacking and publication bias

iii. P values and statistical significance

(Note: names have been changed)
I glared at the data glowing back at me on the computer screen – the postdoctoral fellow who was mentoring me had mentioned finding a p-value for all the numbers I had. 
“Sorry, Florence, how did you want me to analyze this data again?” 
“Just calculate the SEM for each group, stick the numbers in PRISM, and then look for the p-value and see if the difference is significant.” Then she walked away. Something about a motor neuron prep and her mouse embryonic spinal cords sitting on ice for too long. 
I had no idea how to do what she wanted me to do. What is an SEM? Why aren’t we calculating standard deviation instead? Google searches brought up a myriad of statistics websites that attempted to explain how to derive the equation used to calculate an SEM, but with little context as to why you would even want an SEM in the first place. I tried to recall the statistics class I took sophomore year, but only came up with “the p-value indicates significant results.” …Right? I sat there questioning my own competence before continuing to toil over how to do the calculations on the numbers from my qPCR. When I finally got the numbers and graphs, I was dismayed that there was no significance. When Florence came around again, I showed her the results. 
“Oh. Well, that sucks. But there seems to be a trend. And we’re only at n=2, so I think if we just increase our n, it’ll probably be significant.”

Type of error bar
Conclusion if they overlap
Conclusion if they don’t overlap
No conclusion
No conclusion
P > 0.05
No conclusion
95% CI
No conclusion
P < 0.05
(assuming no multiple comparisons)
 Rule of thumb provided by GraphPad's FAQ

How many of us have been put in a similar situation or have heard of a situation like this?

Without a strong background or understanding of statistics, I blindly trusted Florence’s logic and choice of statistical analyses – she was a postdoctoral fellow after all. She’s probably done more than two dozen of these kinds of statistics on her own data that granted her her Ph.D. She must know what she’s doing, I reasoned. But that was the danger of scientists who were improperly or inadequately trained to conduct statistical analyses: in hindsight, I realized that 1) few people (or even scientists, me included) actually understand what “significance” really means, and 2) as Motulsky puts it, “once some people hear the word significant, they often stop thinking about what the data actually show.” The scenario I recounted is something Simmons, Nelson, and Simonsohn (2012) termed “P-hacking,” a term that refers to attempts by investigators to lower the P value by trying various analyses or by analyzing subsets of data. Motulsky draws out two ways in which investigators do this: 1) by tweaking data (if one analysis didn’t give a P value less than 0.05, then they tried a different one) and/or 2) by changing the sample size post hoc (stopping data collection if the P value is less than 0.05, but collecting more data when the P value is about 0.05).

One study by Gotzche (2006) looked at comparing the number of publications that reported a P value between 0.04 and 0.06, hypothesizing that if results were published honestly, the number of publications reporting a P value between 0.04 and 0.05 and a P value between 0.05 and 0.06 should be similar. In the analysis, Gotzsche found that there were five times as many papers reporting P values between 0.04 and 0.05 compared to P values between 0.05 and 0.06. The emphasis on statistical “significance” equating as scientific significance ends up skewing the publication of results and data and creates publication bias. I really wonder if some scientists believe that inadvertently p-hacking is a legitimate way to conduct statistical analyses, or if some do it knowing that it is the improper way to generate "significant" results.

Perhaps the fix here is for journals to start requiring authors to submit a short cover note explaining the justification of the utilized statistics to corroborate that they understood why and how the statistical tools were chosen and used. In this way, it could force scientists to not only conduct reliable and properly designed experiments, but also to think more carefully about the interpretation of their results, rather than just trying to force or find significance that might not be there.


  1. I really like the conclusion you draw that scientists should have to submit a cover letter describing their statistical analysis when publishing data. That would likely eliminate a lot of the flawed statistical analyses that we see in scientific journals.

    In addition to the suggestion that you proposed, I think scientists should be held to a standard where they submit a scientific and statistical analysis prior to an experiment and are ONLY allowed to perform that analysis on their data upon conclusion of the study. I feel that this would eliminate a lot of the "searches for significance" that scientists perform on experiments that don't work out how they intend. If you find a result that is not significant but was not part of your initially outlined experiment, you should have to perform another experiment with that result as you focus. I feel that overall this would eliminate a lot of the "p-hacking" that you described in your post.

  2. I really enjoyed this post, and in my post, I touch on the same subject of "p-hacking" because many individuals within my field use incorrect statistical tests to reduce the sample size and still obtain a "significant" result. However, I question, like you, how many of them actually understand what a "significant" result is and sometimes if they even understood what they were testing in the first place.

    Your scenario reminds me of a recent time when I was attending my program's research-in-progress presentations. There was a student giving an update on her research, and to be quite frank, she was rather unenthusiastic about the lack of "statistically significant" results. Finally, she reached a slide where she had data that appeared to "trend towards being significant". Now at this point, I think that she could fairly say that she had preliminary data, and she should now perform an appropriate power calculation to really test if the results are statistically significant. To my dismay, the next thing she said was "I'm going to keep doing more samples until this become significant". I cringed in my seat and was happy that TJ wasn't present because I'm sure that would have sent him over the edge! :-) This made it clear to me that most biomedical scientists do need more training in statistics, and I agree that journals should ask some more questions on how one obtained a statistically significant result and the approach used. If the answer is like the one I experienced, the paper's conclusions relying on a statistically significant result should be questioned.

    1. I think it's even more dismaying because we don't know what we don't know when it comes to statistics, too. I definitely would have also cringed in my seat if I was listening to that same talk

    2. I think it's even more dismaying because we don't know what we don't know when it comes to statistics, too. I definitely would have also cringed in my seat if I was listening to that same talk

  3. Your post is insightful and I think captures a problem that many of us face - trusting that people with more education know what they're doing and can manage their data responsibly. I do this with journal articles, as well, where I take them at their word that some result was obtained and had a certain significance. This is partially because I see them as an authority (they got something published, so they must know what they're doing?) and because I'm not confident enough in my statistics to critique others. Both of these issues can be resolved with better statistical education at varying levels of school so that authors actually might know what they're doing and I might feel comfortable picking them apart. Your suggestion of including a note with each article is also great, as it would force people to justify their decisions and give a reader the chance to decide whether they made the correct choice.

  4. I guess there is a need of common statistical mistakes education for scientists. P-hacking is definitely a large one to include. After I touched this term in this class, I started to realize that people in my lab are doing it because they do not aware that it is a problem.

    It is hard for people to question"could my significant results from multiple comparisons generated by random chance" when they get a result makes them happy.

    And I agree with you about the short cover page idea--it helps peer review to go a lot easier and without interference from the research content--sometimes, the statistical mistakes can be hart to catch under cover of science.