Wednesday, April 13, 2016

Repetitive T-Tests Instead of an ANOVA

Goal of the Paper: The goal of this paper was to determine the change in number and activation status of peripheral blood T-cell subsets during two blood-stage infection models of malaria. One model involved two-short-course infections while the other model used a long-course infection. The investigators were interested to know if T-cell subsets changes within each group over time and if these changes were different between the different infection models at the same time point.

Experimental Design: Three animals were assigned to each group for a total of six animals in the cohort. An initial pre-infection sample was collected from each animal and was used as the baseline value for each macaque. Samples were then taken at specific time points after inoculation for analysis by CBC and flow cytometry. Comparisons were then performed to compare the data from baseline with other points in the infection within infection model and to determine if there were differences between the infection models at a specific time point.

Critiques of the Paper:
1.  Repetitive t-tests to compare within group and between groups whenever the experimental design reflects the need for performing a two-way ANOVA with an appropriate post-hoc analysis.

The most egregious error in this paper is the use of t-tests to compare between group and within group based on this experimental design. Figure 3 is pasted above for reference and confirms this was the approach used by the authors. According to the methods, a Student’s t-test was used to assess if there were differences between groups at different time points, and a paired t-test was used to determine if there were significant changes within group compared to the baseline value. Conceptually, the authors knew that between group analyses did not need a paired analysis and that within group analyses did. However, the approach that was used was incorrect. By performing repetitive t-test, the authors inflated their type-I error well above the established threshold of 0.05, and given the sample size of the study, most of the statistically significant results are likely invalid.
            The experimental design calls for a two-way, repeated measures ANOVA with an appropriate post-hoc to address the objective. The approach that the authors should have taken is to perform an initial two-way, repeated measures ANOVA with group (i.e. infection model) and time point as the two factors. After performing this analysis, the results would inform if there were significant changes based on subject, group, time, and if there was an interaction-effect occurring. If significant, the next step would have been to perform a post-hoc analysis, and in the case of this experiment that appears to be underpowered with only 3 animals per group, only specific planned comparisons should be performed to conserve alpha. Using an unplanned comparison approach would be unwise because it would likely be too underpowered to identify any significant differences, especially if a pairwise analysis was performed for every possible combination.

2.     Figures are poorly designed and do not clearly indicate the relevant information, and the captions are confusing and unclear.

The figures in this analysis clearly indicate that individuals are being followed over time (see figure 3 above). This is appropriate representation of the data, but unfortunately, the other aspects of the figure are lacking. For instance, the arrows on the figure indicate inoculation and drug treatment. One group had a different inoculation and drug treatment regimen than the other, and thus, displaying the data on the same graph is bad data presentation. Further, the repetitive t-tests lead the authors to use a strange convention of denoting the statistical significance between a time point and a baseline value. With an appropriate two-way, repeated measures ANOVA this could have been rectified. Overall, I would likely suggest that the data be graphed separately based on group and a table be generated to show significant differences in outcome variables between groups to make it clearer and more effective for the reviewer/reader.

3.     Biological conclusions should be questioned as treatment could be considered a confounding/third variable.

One of the goals of the study was to determine if there were differences in T-cell responses between groups. Indeed, the two-way ANOVA that I mentioned above would answer this question in the most appropriate statistical manner based on the experimental design. However in that approach, the assumption is that drug intervention and re-inoculation have no effect on the T-cell values. Based on my experience with these drugs and this model, I would say that this is a fair assumption. However, it is worth recognizing this aspect of the design and understanding the appropriate statistics should that assumption not be made. If the drug intervention was added into the current Two-Way ANOVA approach, this would for a three-way ANOVA. As we have learned in the course, it is virtually impossible to interpret the results of a three-way ANOVA because of the complexity of the experimental design and, thus, the null hypothesis. Therefor if treatment was going to be a factor, a linear or nonlinear model would likely be needed to determine the effect between groups, within group over time, and if there was an effect of treatment, and if there were three-way interactions between the different factors.

Overall Conclusion: This paper does not use appropriate statistics, has poor data representation, figure captions, and graphs, and the overall assumptions made should be questioned. Additionally the fact that it is likely severely underpowered, the conclusions drawn are likely erroneous and could largely be false-positives with the specific statistical approach that was taken. Finally, I suspect the authors were “p-hacking” to achieve significance and that is why they went with the t-tests and not the ANOVA analysis that the experiment calls for.


  1. This comment has been removed by the author.

  2. I agree that the t-tests should have not been used and that a two-way, repeated measures ANOVA would have been more appropriate for the experiment. However, as you mentioned, an n=3 per group would not have resulted in statistical significance with this approach even if it looked as if there were differences in the form of a graph. In this situation it would have been necessary to repeat the experiment with larger sample sizes before considering whether to publish or not, but I work with mice. As was brought up in class, this may not be a feasible approach when working with non-human primates. I am curious about whether the use of t-tests is more "tolerated" than comparing data with ANOVA amongst scientists working with non-human primates as compared to analyzing/publishing results from rodent models since these rodent colonies are less expensive to maintain. Mice also have much shorter lifespans and therefore may be able to produce replications in a shorter time. I wonder if there is some kind of open database that could be made available to scientists working with non-human primates so that they could compare across studies to see if the trends that they see with small sample sizes may be worth further replication. It seems like this kind of resource would be helpful in terms of trying to avoid the issue you bring up of false positives. Obviously, the best way to avoid this would be replicating with larger samples, as mentioned above. However, I wonder if this is really financially feasible when working on studies with non-human primates.