When we think of experimental power, we're usually in the mindset of setting things up to have enough power to detect a treatment effect. In fact, we should also think about detecting an effect size accurately. It turns out that underpowered experiments not only make it less likely to detect effects, but are also more likely to lead to an exaggerated effect size.
Here's a simulation to illustrate this point.
We simulate a run of unpaired two-sided t-tests that compare a drug effect to a placebo effect at each of two different power levels. Because we define the parameters of the sampled population, we know for a fact that the response level to the drug is 350 response units (SD = 120). The response level to the placebo is 200 response units (SD = 120 also). The variables are assumed to be normally distributed.
Thus, in testing over the long run, the drug would be expected to have an average effect of 150 response units over that of placebo. How accurately do experiments conducted at low and high power estimate this effect size between the drug and placebo?
The experiment is simulated at low power (~50%, sample size per group = 6) and at high power (~90%, sample size per group = 15). An unpaired two-sided t-test is done for each of 1000 random samples under each of the two power conditions. At 50% power about half of the t-tests have a p-value of 0.05 or less, whereas 90% of the tests at higher power are positive.
Now let's look at the averages of the mean group response values only in those tests that correctly detect a "significant" difference between drug and placebo treatment.
In the low power simulation, the average drug effect is 375 response units and that for placebo is 176, for a difference of 199 response units. Thus, in the long run, the average estimate from the low power is inaccurate by 49 response units.
In contrast, the high power simulation yields response unit estimates of 353 and 197 for the drug and placebo groups, respectively. They are accurate to within 6 response units from what we coded in for the groups.
When using p-value based cut off criteria to make decisions on whether an experiment worked or not, we are at risk of making inaccurate estimates of the effect size of a treatment. The high powered tests not only detect more "positive" results, but they are also much more accurate!
Some in the biomedical world have called for increasing the stringency of type 1 error tolerance, setting the "significance" threshold at p< 0.01 rather than p<0.05 alpha. What effect do you think this will this have on effect size estimation of underpowered experiments?
Unbiased Research
statistical design and analysis of experiments
Tuesday, July 23, 2019
Friday, February 23, 2018
Simulating correlated variables
Measures collected in paired/related measure experimental designs are...correlated!
There is a lot of utility in simulating experimental outcomes...for example, when doing a priori power analysis by Monte Carlo, when generating bootstrap distributions, etc.
To simulate paired observations appropriately its important to account for their paired correlations.
Here's a simple R script that accomplishes that, including a check for accuracy, illustrated for a case when 80% correlation is expected.
There is a lot of utility in simulating experimental outcomes...for example, when doing a priori power analysis by Monte Carlo, when generating bootstrap distributions, etc.
To simulate paired observations appropriately its important to account for their paired correlations.
Here's a simple R script that accomplishes that, including a check for accuracy, illustrated for a case when 80% correlation is expected.
Monday, February 12, 2018
Accuracy v Precision
This week we're going to touch a bit on the concepts of accuracy and precision.
In biomedical science, we're usually measuring outcomes that nobody has every measured before in quite the same way (and probably never will measure them again!)
As a consequence, we rarely have any sense about how accurate our results are. To know accuracy, we'd have to know what the "true" value in a population we've sampled for some outcome variable . That truth is usually elusive.
So most of the time the best we can do is estimate precision. As we move into the statistics of measured data, the standard error of the mean (sem) is the statistic that we use for assessing precision. The lower the sem, the greater then precision.
I'm omitting from the slide deck this year one of my favorite pics that illustrates wonderfully how the battle for greater accuracy and improved precision plays out over time.
This graph is from the Particle Data Group at the Lawrence Berkely Lab. The pdf that has 11 other graphs much like it. This one shows the evolution of the mass of the eta particle, over time.
For over 20 years the mass was one value. Since then, the mass value has bounced around quite a bit. The most current estimate of the mass appears very precise.
But is it accurate?
The true mass of these eta has remained the same, while the accuracy and precision by which it is measured continues to fluctuate.
In biomedical science, we're usually measuring outcomes that nobody has every measured before in quite the same way (and probably never will measure them again!)
As a consequence, we rarely have any sense about how accurate our results are. To know accuracy, we'd have to know what the "true" value in a population we've sampled for some outcome variable . That truth is usually elusive.
So most of the time the best we can do is estimate precision. As we move into the statistics of measured data, the standard error of the mean (sem) is the statistic that we use for assessing precision. The lower the sem, the greater then precision.
I'm omitting from the slide deck this year one of my favorite pics that illustrates wonderfully how the battle for greater accuracy and improved precision plays out over time.
This graph is from the Particle Data Group at the Lawrence Berkely Lab. The pdf that has 11 other graphs much like it. This one shows the evolution of the mass of the eta particle, over time.
For over 20 years the mass was one value. Since then, the mass value has bounced around quite a bit. The most current estimate of the mass appears very precise.
But is it accurate?
The true mass of these eta has remained the same, while the accuracy and precision by which it is measured continues to fluctuate.
Thursday, February 8, 2018
"Hype" in Scientific Research
There are many problems with the current state of published
research, many of which I have noticed in my four short years being involved in
this world, and many of which appear not to be recent phenomena but rather
problems that have plagued science since its inception. Of all these
issues, the one that I think is often most harmful to the progress of
scientific research and the proper application of the scientific method is the
tendency for researchers, research institutions and the press to overhype and
often completely misrepresent the significance of their findings.
As the October 2015 Vox article states, “the
media has a penchant for hailing new medical ‘breakthroughs’ and ‘miracles’ –
even when there’s no evidence to back up those claims.” But the media are not
solely to blame. As the author points out, the researchers, doctors,
institutions and companies involved in the production of these “miraculous”
studies often overstate the significance of their research in the first place.
One example that comes to mind is a recent study showing the effect of
retooled diabetes drugs on the progression of Alzheimer’s in transgenic mice. I
was surprised to find headlines in national news sources, such as “Report: Scientists Find Alzheimer’s Treatment While Trying To Cure Diabetes” that
fail to acknowledge that the study is simply in mice and light-years away from
the clinical trial data required to make such a claim. To even suggest it was a valuable mouse study, careful reviews of potential biases, statistical and experimental methods
are required.
I think overhyping and overstating the
significance of research often leads to cutting corners in the proper
application of the scientific method. Hypotheses quickly become regarded as
theory and a lot of time and money is invested in ideas that are flawed to
begin with. I think increasing criticism and analysis of publications via
comment sections and forums such as PubPeer is an important step in
reducing scientific hype. Most importantly I agree with the Economist article, which
suggests that graduate student education in statistical methods, as well as
encouraging a skeptical outlook on scientific research are key to combating
this issue.
Monday, February 5, 2018
Prospective Solutions to the Lack of Peer Review Rigor and Reproducibility in Science
The Vox article about PubPeer discussed how the platform
provides a simple, yet interesting, solution to the lack of rigor in peer
review. We as media consumers observe every single day how the court of public
opinion makes for an important swing vote when judging the actions of the most
elite within society. PubPeer capitalizes on this idea, though its court is filled
with a much more niche audience of scientists. I wholeheartedly support this
platform as bias and unreliable data is so easily missed during the peer review
process. PubPeer provides additional support to ensure that articles that
slipped through the cracks of a broken peer review process are retracted. I
also see PubPeer as a mechanism to provide a forum for collaboration. Its
electronic platform could be a nidus for the formation of collaborations
between research groups. In my opinion, these types of collaborative research
projects would foment independent replication of experiments and ultimately promote
scientific rigor.
However, I believe the systematic problem with peer
review that cannot be addressed by PubPeer is the issue surrounding publication
bias and the fact that unpublished articles are enriched with null results. Changing
this requires a paradigm shift in which scientists and editors embrace the
knowledge that a null result provides whilst simultaneously railing against the
mentality that “impact factors” reign supreme. As was mentioned in the Economist article about unreliable research, “researchers ought to be judged on the basis of the quality,
not the quantity, of their work.” However, the scientific community needs to
begin to realize that quality science does not always lead to hypothesis
confirmation.
The Economist article about unreliable science also sparked thoughts on additional ways in which the peer review process could be improved on the front end. For example, better training should be provided to editors. In addition, editor performance should be incentivized. Lastly, journals should foster a closer working relationship with statisticians so that raw data of prospective journal articles can be cross-checked by a journal-affiliated statistician.
Overall, I believe that PubPeer and replication initiatives
by PLoS ONE are good starting points when attempting to fix the peer review and
replication problems that run rampant within the scientific community. However,
as mentioned above, the most difficult change to make relates to the pervasive
mentality about science that propagates these issues upstream of where our current initiatives are working.
Wednesday, January 24, 2018
Tuesday, January 23, 2018
Hidden Biases: Looking for Bias in Unexpected Places
Assuming
that we, and all of our colleagues, are striving to conduct unbiased science, I
often forget to look for bias in obvious places. Sure, there’s the scientist
trying to make his or her research appear more important, the prototypical bias
we have come to expect, but what about the biases we can’t control?
This article fromNPR points out that bias can be virtually impossible to avoid. To make a
long story short, a researcher found that the results of his experiment had
been skewed by natural hormone differences between men and women. Depending on
the gender of the person administering the experiment, the results were
different.
Of course
this doesn’t mean that one gender gets inherently better results than the
other, but it does illustrate an important idea. Not only do we have to be proactive
in controlling our own personal biases, but also we have to look for external
factors that may affect or otherwise disrupt results.
I wouldn’t
have expected gender of the experimenter to affect results of an experiment.
This could be the tip of the iceberg. Does the color of the experimenter’s eyes
matter? What about the perfume the experimenter was wearing? Of course some
external factors are inevitable, and the point of the article wasn’t to condemn
scientists of either gender. The article was a simple plea that scientists
report these kinds of external factors.
To me, this
requires two fold effort from scientists: First, we need guidelines on the
myriad external factors that can affect our science. Second, we need to be
vigilant, reporting every factor that could possibly affect results, no matter
how seemingly inconsequential. On the surface, it sounds like a pain. I understand
that. It isn’t easy to keep copious and detailed notes. Reporting like this
will, almost certainly, be tedious. Still, if there is any chance that a factor
could affect reproducibility, it should be recorded.
More than
anything, I think the real call to action is to think about hidden ways that
bias could exist in our work. I think our experiments and our credibility will
be all that much better for it.
Subscribe to:
Posts (Atom)