Tuesday, April 19, 2016

P-hacking, and proud of it

I did not pick up this paper in the hopes of using as a BadStats example, which makes the improper statistics all the more disappointing. The behavioral output used by Gaskin et al. (2010) was an investigation ratio (IR) between exploration times of repeated and novel objects. This metric made their abuse of statistical design personal, because I use the exact same behavioral output in my experiments. The premise of the study was sound, and was developed in response to an earlier, poorly designed study.  Albasser et al. (2009) claimed to find a relationship between the amount of time a rat explores an object and later object memory. These results were pooled from subjects in different experimental conditions, and therefore the correlation they ran encompassed multiple independent variables. Gaskin et al. (2010) strove to experimentally test the relationship between exploration duration and memory strength.

The experimental setup was fairly sound. In total, the team performed 4 experiments to assess any correlation between exploration and memory, and then manipulated exploration time to determine if a causal role existed. Problems came in most notably in experiments 3 and 4, in which the authors compared the memory of control and dorsal hippocampus-lesioned rats from a novel object preference task. Importantly, the only difference between experiments 3 and 4 were the retention intervals between the study session (in which rats viewed two identical copies of an object) and the multiple test sessions (in which rats viewed a copy of the object from the study phase and a novel object). Experiment 3 used a 2-hour delay for the first test, then 24 hour intervals for the next 4 days, and experiment 4 separated all tests by 35 seconds. Not surprisingly, the data from the two experiments were presented in identical fashion:



Really the only way to tell which is which is the labelling of Days (experiment 3) versus Tests (experiment 4). Even with the identical design, their analyses smack of p-hacking. To their credit, they are fully transparent about it.

In experiment 3, analyses of IRs included:
1)      One-sample t-tests for both groups at every time point to determine if IRs were above chance (0.5) – found significant differences
2)      Independent sample t-tests to compare IRs between control and lesioned groups at every time point – found significant differences
3)      One-way RM ANOVA (test days as repeated measure) run separately for control and lesion groups – they did report a significant F-value in the control group, but claim it “was only due to a significant difference between the IRs obtained during Days 2 and 3”, and therefore dismiss it
Alright. Got that?  Here were the analyses for experiment 4:
1)      One-sample t-tests for both groups at every time point to determine if IRs were above chance (0.5) – found significant differences
2)      Independent sample t-tests to compare IRs between control and lesioned groups at the first time point – found significant differences
3)      One-way RM ANOVA on both groups – showing no effect of test session on either group.
4)      Two-way mixed ANOVA by group and test session – found a trend toward group effect and a significant effect of test session.
a.       Post hoc comparisons of IRs between groups – found significant differences

By the end, I was confused.  Why run a two-way ANOVA in experiment 4 and not 3?  Why also run a one-way ANOVA in experiment 4?  Why run separate one-way ANOVAs on the two groups that end up being compared by t-tests? Why run independent t-tests and an ANOVA for the same data in experiment 3?

The authors set up the paper by saying that differences in IR between groups are not behaviorally relevant - only differences from chance.  That being the case, the independent sample t-tests are improper. A mixed factor two-way ANOVA does address changes over time and differences between groups, and their use of a two-way ANOVA in experiment 4 was correct.  Adding on one-way ANOVAs to the same data, however, was not.

There is no report of the raw data, so I cannot reevaluate their results.  However, the varied analyses and redundant tests (with varied appropriateness) make me feel like I was sitting in on a graduate student running every test they could think of to drive that p-value down.  I commend the authors for reporting them all, but transparency does not make biased, exploratory analyses good practice.

References

No comments:

Post a Comment