Tuesday, October 11, 2016

Does Misuse of Statistical Significance Give Us a Higher Likelihood of Immortality?

This past week, a letter in Nature was published by the Department of Genetics at Albert Einstein College of Medicine in New York. The letter’s title was pithy and shocking “Evidence for a limit to human lifespan.” GASP! “But this can’t be true,” some may say, “with advances in medicine, we’re only continuing to lengthen the human lifespan!” First off, that’s what I think is a bit of a teleological argument – assuming just because humans have certain intelligences about the world, our purpose is to expand life or live forever. There’s some sort of moral purpose that is implied in that answer. Anyways, that’s not the point, but it sets off an interesting point of debate

Maybe we, humans, aren’t supposed to live forever. We forget that we are but specks on the lithograph of time; we’re babies! Only 200,000 years has the species Homo sapiens existed. On a 4.5 billion-year old planet. Many things have lived before us, and many things will live after us. Perhaps this sounds a bit nihilistic, but I was happy to hear these reports from these geneticists. Humans are single-handedly taking the gauntlet to ourselves and our planet, making sure life, in any form, is going to have a hard time existing on a planet where the temperature is rising almost 2 degrees Fahrenheit each year. So yes, I was overjoyed, ecstatic, and relieved to hear we may be shortening the execution of our planet Earth -- until I heard these geneticists used statistics to come to their results. Oh, great, another potential mishap with statistics by those not formally trained in statistics, I thought.

The methods that the study employed were actually pretty simple. Most people who took a basic stats class could probably understand everything up until the cubic smoothing splines. The authors plotted the maximum reported age at death (MRAD) for 534 people over the age of 110 or what they call “supercentenarians” gathered from the International Database on Longevity. These supercentarians came from Japan, France, US and the UK. Two linear regressions were done for subsets of data between 1968-1994 and 1995-2006. The scientists found an increase in MRAD of 0.15 per year before 1995 (with r=0.68 and p=0.0007) and a decrease in MRAD of 0.28 years from 1995-2006. There was a r=-0.35 and a p-value of p=0.27 for the decreasing MRAD. The scientists didn’t expand on the statistical shortcomings of their decreasing MRAD points. The scientists conducted the same procedure with other data points, and found a similar overall trend – that is, a significant increase until breakpoint and an insignificant decrease after breakpoint -- but did not discuss the weak correlation or significance.

The initial criticisms of the work have been predictable, to say the least. Why didn’t the scientists discuss their dismal p-values? Moreover, some say that seeing a significant increase in MRAD then observing a decrease, even if it was insignificant, is still something. This is, however, an amateur cop out, in my opinion. There are so many more reasons to be critical of the work than just the terrible p-values, and it starts with study design.

The main figure in Dong et. al. and the bane of my existence these last four days...

First, it’s really not clear why the scientists decided to use the arbitrary breakpoint between 1994 and 1995. Clarification of this would be helpful, if not crucial, to understand, as the entire crux of the paper’s argument leans on this breakpoint and its subsequent data analysis (Figure 2a). It appears that the breakpoint was chosen arbitrarily to support their initial hypothetical claim that humans have reached a plateau in age advancement, and the linear regression decrease model in MRAD is used in a rhetorical sense, rather than in a strict statistical-sense of proving their ad-hoc null-hypothesis statistical test (NHST). Is it even right to use NHST here? I’d argue no. If their alternative hypothesis was that r=0, it’s increasingly hard to test effect sizes closest to zero – you probably need to have a larger sample size, and with this data, data that exists on the margins of collectability, that’s hard to do.

The authors do make an implied comment on the power of their statistical tests, noting that they probably don’t have a large enough sample size to make case for a more robust statistical model. To alleviate this, they applied a post-hoc sample expansion to include several RADs as the highest point, including the MRAD and the second through fifth RAD; this appears to be more or less a sneaky way of doing away with outliers to me. They concluded that the average RADs had not increased since 1968, and that “all series showed the same pattern as the MRAD.” But instead of applying an ANOVA to test their hypothesis about average RADs not affecting the means, they never reference any supplemental material. It appears that they just eyed it. Another suggestion would be to compare the slopes of the “plateau” regions by doing some sort of t-test.


In the end, the statistical procedures taken to prove their own point would have any data scientist’s head spinning; probably enough so that people just wouldn’t take a hard look at it for too long. This is dangerous, but we can’t say Nature hasn’t committed the crime before: many times, the journal has published funky stats simply because the title of the study was provocative (as demonstrated by our class this semester). Nonetheless, you have to tip your cap at the scientists who wanted to make a headline, they surely did it, then perhaps bonk them on the head with your cap and tell them to do better stats before they make such grand claims.

No comments:

Post a Comment