Unbiased Research

Tuesday, July 23, 2019

How p-values of underpowered studies select for exaggerated effect size

When we think of experimental power, we're usually in the mindset of setting things up to have enough power to detect a treatment effect. In fact, we should also think about detecting an effect size accurately. It turns out that underpowered experiments not only make it less likely to detect effects, but are also more likely to lead to an exaggerated effect size.

Here's a simulation to illustrate this point.

We simulate a run of unpaired two-sided t-tests that compare a drug effect to a placebo effect at each of two different power levels. Because we define the parameters of the sampled population, we know for a fact that the response level to the drug is 350 response units (SD = 120). The response level to the placebo is 200 response units (SD = 120 also). The variables are assumed to be normally distributed.

Thus, in testing over the long run, the drug would be expected to have an average effect of 150 response units over that of placebo. How accurately do experiments conducted at low and high power estimate this effect size between the drug and placebo?

The experiment is simulated at low power (~50%, sample size per group = 6) and at high power (~90%, sample size per group = 15). An unpaired two-sided t-test is done for each of 1000 random samples under each of the two power conditions. At 50% power about half of the t-tests have a p-value of 0.05 or less, whereas 90% of the tests at higher power are positive.

Now let's look at the averages of the mean group response values only in those tests that correctly detect a "significant" difference between drug and placebo treatment.

In the low power simulation, the average drug effect is 375 response units and that for placebo is 176, for a difference of 199 response units. Thus, in the long run, the average estimate from the low power is inaccurate by 49 response units.

In contrast, the high power simulation yields response unit estimates of 353 and 197 for the drug and placebo groups, respectively. They are accurate to within 6 response units from what we coded in for the groups.

When using p-value based cut off criteria to make decisions on whether an experiment worked or not, we are at risk of making inaccurate estimates of the effect size of a treatment. The high powered tests not only detect more "positive" results, but they are also much more accurate!

Some in the biomedical world have called for increasing the stringency of type 1 error tolerance, setting the "significance" threshold at p< 0.01 rather than p<0.05 alpha. What effect do you think this will this have on effect size estimation of underpowered experiments?

Friday, February 23, 2018

Simulating correlated variables

Measures collected in paired/related measure experimental designs are...correlated!

There is a lot of utility in simulating experimental outcomes...for example, when doing a priori power analysis by Monte Carlo, when generating bootstrap distributions, etc.

To simulate paired observations appropriately its important to account for their paired correlations.

Here's a simple R script that accomplishes that, including a check for accuracy, illustrated for a case when 80% correlation is expected.

Monday, February 12, 2018

Accuracy v Precision

This week we're going to touch a bit on the concepts of accuracy and precision.

In biomedical science, we're usually measuring outcomes that nobody has every measured before in quite the same way (and probably never will measure them again!)

As a consequence, we rarely have any sense about how accurate our results are. To know accuracy, we'd have to know what the "true" value in a population we've sampled for some outcome variable . That truth is usually elusive.

So most of the time the best we can do is estimate precision. As we move into the statistics of measured data, the standard error of the mean (sem) is the statistic that we use for assessing precision. The lower the sem, the greater then precision.

I'm omitting from the slide deck this year one of my favorite pics that illustrates wonderfully how the battle for greater accuracy and improved precision plays out over time.

This graph is from the Particle Data Group at the Lawrence Berkely Lab. The pdf that has 11 other graphs much like it. This one shows the evolution of the mass of the eta particle, over time.

For over 20 years the mass was one value. Since then, the mass value has bounced around quite a bit. The most current estimate of the mass appears very precise.

But is it accurate?

The true mass of these eta has remained the same, while the accuracy and precision by which it is measured continues to fluctuate.

Thursday, February 8, 2018

"Hype" in Scientific Research

There are many problems with the current state of published research, many of which I have noticed in my four short years being involved in this world, and many of which appear not to be recent phenomena but rather problems that have plagued science since its inception. Of all these issues, the one that I think is often most harmful to the progress of scientific research and the proper application of the scientific method is the tendency for researchers, research institutions and the press to overhype and often completely misrepresent the significance of their findings.

As the October 2015 Vox article states, “the media has a penchant for hailing new medical ‘breakthroughs’ and ‘miracles’ – even when there’s no evidence to back up those claims.” But the media are not solely to blame. As the author points out, the researchers, doctors, institutions and companies involved in the production of these “miraculous” studies often overstate the significance of their research in the first place.

One example that comes to mind is a recent study showing the effect of retooled diabetes drugs on the progression of Alzheimer’s in transgenic mice. I was surprised to find headlines in national news sources, such as “Report: Scientists Find Alzheimer’s Treatment While Trying To Cure Diabetes” that fail to acknowledge that the study is simply in mice and light-years away from the clinical trial data required to make such a claim. To even suggest it was a valuable mouse study, careful reviews of potential biases, statistical and experimental methods are required.

I think overhyping and overstating the significance of research often leads to cutting corners in the proper application of the scientific method. Hypotheses quickly become regarded as theory and a lot of time and money is invested in ideas that are flawed to begin with. I think increasing criticism and analysis of publications via comment sections and forums such as PubPeer is an important step in reducing scientific hype. Most importantly I agree with the Economist article, which suggests that graduate student education in statistical methods, as well as encouraging a skeptical outlook on scientific research are key to combating this issue.

Monday, February 5, 2018

Prospective Solutions to the Lack of Peer Review Rigor and Reproducibility in Science

The Vox article about PubPeer discussed how the platform provides a simple, yet interesting, solution to the lack of rigor in peer review. We as media consumers observe every single day how the court of public opinion makes for an important swing vote when judging the actions of the most elite within society. PubPeer capitalizes on this idea, though its court is filled with a much more niche audience of scientists. I wholeheartedly support this platform as bias and unreliable data is so easily missed during the peer review process. PubPeer provides additional support to ensure that articles that slipped through the cracks of a broken peer review process are retracted. I also see PubPeer as a mechanism to provide a forum for collaboration. Its electronic platform could be a nidus for the formation of collaborations between research groups. In my opinion, these types of collaborative research projects would foment independent replication of experiments and ultimately promote scientific rigor.

However, I believe the systematic problem with peer review that cannot be addressed by PubPeer is the issue surrounding publication bias and the fact that unpublished articles are enriched with null results. Changing this requires a paradigm shift in which scientists and editors embrace the knowledge that a null result provides whilst simultaneously railing against the mentality that “impact factors” reign supreme. As was mentioned in the Economist article about unreliable research, “researchers ought to be judged on the basis of the quality, not the quantity, of their work.” However, the scientific community needs to begin to realize that quality science does not always lead to hypothesis confirmation.

The Economist article about unreliable science also sparked thoughts on additional ways in which the peer review process could be improved on the front end. For example, better training should be provided to editors. In addition, editor performance should be incentivized. Lastly, journals should foster a closer working relationship with statisticians so that raw data of prospective journal articles can be cross-checked by a journal-affiliated statistician.

Overall, I believe that PubPeer and replication initiatives by PLoS ONE are good starting points when attempting to fix the peer review and replication problems that run rampant within the scientific community. However, as mentioned above, the most difficult change to make relates to the pervasive mentality about science that propagates these issues upstream of where our current initiatives are working.

Wednesday, January 24, 2018

Tuesday, January 23, 2018

Hidden Biases: Looking for Bias in Unexpected Places

Assuming that we, and all of our colleagues, are striving to conduct unbiased science, I often forget to look for bias in obvious places. Sure, there’s the scientist trying to make his or her research appear more important, the prototypical bias we have come to expect, but what about the biases we can’t control?

This article fromNPR points out that bias can be virtually impossible to avoid. To make a long story short, a researcher found that the results of his experiment had been skewed by natural hormone differences between men and women. Depending on the gender of the person administering the experiment, the results were different.

Of course this doesn’t mean that one gender gets inherently better results than the other, but it does illustrate an important idea. Not only do we have to be proactive in controlling our own personal biases, but also we have to look for external factors that may affect or otherwise disrupt results.

I wouldn’t have expected gender of the experimenter to affect results of an experiment. This could be the tip of the iceberg. Does the color of the experimenter’s eyes matter? What about the perfume the experimenter was wearing? Of course some external factors are inevitable, and the point of the article wasn’t to condemn scientists of either gender. The article was a simple plea that scientists report these kinds of external factors.

To me, this requires two fold effort from scientists: First, we need guidelines on the myriad external factors that can affect our science. Second, we need to be vigilant, reporting every factor that could possibly affect results, no matter how seemingly inconsequential. On the surface, it sounds like a pain. I understand that. It isn’t easy to keep copious and detailed notes. Reporting like this will, almost certainly, be tedious. Still, if there is any chance that a factor could affect reproducibility, it should be recorded.

More than anything, I think the real call to action is to think about hidden ways that bias could exist in our work. I think our experiments and our credibility will be all that much better for it.