Tuesday, July 23, 2019

How p-values of underpowered studies select for exaggerated effect size

When we think of experimental power, we're usually in the mindset of setting things up to have  enough power to detect a treatment effect. In fact, we should also think about detecting an effect size accurately. It turns out that underpowered experiments not only make it less likely to detect effects, but are also more likely to lead to an exaggerated effect size.

Here's a simulation to illustrate this point.

We simulate a run of unpaired two-sided t-tests that compare a drug effect to a placebo effect at each of two different power levels. Because we define the parameters of the sampled population, we know for a fact that the response level to the drug is 350 response units (SD = 120). The response level to the placebo is 200 response units (SD = 120 also). The variables are assumed to be normally distributed.

Thus, in testing over the long run, the drug would be expected to have an average effect of 150 response units over that of placebo. How accurately do experiments conducted at low and high power estimate this effect size between the drug and placebo?

The experiment is simulated at low power (~50%, sample size per group = 6) and at high power (~90%, sample size per group = 15). An unpaired two-sided t-test is done for each of 1000 random samples under each of the two power conditions. At 50% power about half of the t-tests have a p-value of 0.05 or less, whereas 90% of the tests at higher power are positive.

Now let's look at the averages of the mean group response values only in those tests that correctly detect a "significant" difference between drug and placebo treatment.

In the low power simulation, the average drug effect is 375 response units and that for placebo is 176, for a difference of 199 response units. Thus, in the long run, the average estimate from the low power is inaccurate by 49 response units.

In contrast, the high power simulation yields response unit estimates of 353 and 197 for the drug and placebo groups, respectively. They are accurate to within 6 response units from what we coded in for the groups.

When using p-value based cut off criteria to make decisions on whether an experiment worked or not, we are at risk of making inaccurate estimates of the effect size of a treatment. The high powered tests not only detect more "positive" results, but they are also much more accurate!

Some in the biomedical world have called for increasing the stringency of type 1 error tolerance, setting the "significance" threshold at p< 0.01 rather than p<0.05 alpha. What effect do you think this will this have on effect size estimation of underpowered experiments?