Saturday, October 15, 2016

Take that Jenny McCarthy! Statistical Tests Show Improvement in Vaccination Completeness

It hasn’t been too long since celebrity Jenny McCarthy let it be known that she is vehemently opposed to our current vaccines. She was one of the first celebrities to support the pseudoscientific view that vaccinations cause autism, and recently she has made it clear that she thinks anyone carrying a virus is deathly “sick,” in her comments made about former co-star Charlie Sheen, who has HIV. Uh, that’s not exactly how the fields of virology and immunology have deduced the process, Jenny.

Nonetheless, public health officials will soldier on because they recognize the benefits of vaccinating people, especially children, and that such vaccination prevents sickness even in the case of contracting the virus. One question I’ve always had as a bench scientist is how is it that public health officials know they’re doing their job efficiently? I see many of my friends going to public health school wanting to help with the education arm of public health issues. How do we know if the methods they use are effective? What is a quantifiable measure for us to obtain a level of effectiveness?

Figure 1. An example of paired participant studies. In the case of the SUNY Upstate article I looked at the comparison group would be a group of similar age and income in an adjacent community compared with a group of interest receiving the intervention
Dually, I’ve enjoyed spending the semester reading about statistical tests we haven’t gone over in class. One of those tests is known as the McNemar’s test. A quick interwebs definition says McNemar’s test is a “statistical test on paired nominal data,” or basically assigning a binomial outcome to paired data (see Figure 1 for paired data example). When I first read about this, I thought of vaccines. A good signal to public health educators that their programs are working are whether populations are vaccinated or not, specifically communities that face traditional barriers to quality healthcare.

In a community health paper published in 2013, public health professionals at SUNY Upstate tested their hypothetical vaccination intervention, which involved partnering with community organizations such as the Salvation Army, allowing patients a Q&A session prior to vaccination, and connect to vaccination specialists through community liasions. The authors of the study paired their subjects based on age and household income across 10 different community sites, separating them by intervention positive or intervention negative status, and measuring proof of influenza vaccination in the presence or absence of the intervention. They wanted to compare if the intervention had successfully raised the vaccination levels across age cohorts and overall. The group then used McNemar’s test to construct their 95% confidence intervals to illustrate the nearly 17% increase (95% CI 15.5-19.5) in influenza vaccination levels (see Figure 2) to compared to state and county level alternative interventions. Impressive! Although the authors don’t report a p-value, with the right null hypothesis, McNemar can calculate one for you. It’s so handy.

Figure 2. The contigency table used to calculate McNemar's test. As shown in Figure 3, McNemar's test relies on reporting those not receiving vaccinations but are enrolled (not explicitly stated in the chart).

One limitation to McNemar’s test is that it’s meant for large groups. However, based on the population scope of public health data, this doesn’t seem to be an issue – in fact it is an advantage for novice public health professionals to know this fact, especially if they’ve never done statistical analysis.   
Figure 3. A screen grab of the McNemar test calculator found on GraphPad. Motulsky recommends readers use this for calculation of confidence intervals and p-values in his book. 
 All this time, I thought public health professionals went off magnitudes of numbers alone, perhaps testing averages across populations in an ANOVA test. As it turns out, they use statistical tests, specifically McNemar’s test when employing tired-and-true case-control designs.

Pffft...Luke...I Am Your (Updating) Factor!: A Short Guide to Bayesian Statistics and Bayesian Experimental Design

Bayes Theorem – remember that mentioned way, way back in Lecture 2? No, it isn’t some new age way of predicting who you’ll be romantically involved with this winter, but there is a field of inference that comes from this theorem that plays into Bayesian statistics, and that is a subfield of statistics many scientists should be paying attention to.

Up until this point, we’ve basically been learning more frequentist statistics than Bayesian statistics (i.e., heavy on the linear regression, chi-squares, correlations, less so on multiple comparisons, etc.). This is evident by our HistoryStats projects: we’ve been looking at the lives and work of some of the founders of frequentists’ school of thought like Neyman, Pearson, and Wald. How do we best describe these  frequentist statisticians? Well, let’s take a simple, intuitive analogy described by this StackExchange forum. According to “user28” having a frequentist frame of mind is like hearing the phone go off, referring to a model upon which helps you identify the area of your home that phone is going off to make the inference on where the phone is. Having a Bayesian frame of mind means you may have that model in mind, but you also take into account places where you’ve mistakenly left the phone in the past. Simply, frequentists believe that data is a frequency, or a repeatable random sample, while Bayesians believe that data is observed from a real sample. Furthermore, frequentists believe that parameters are fixed, whereas Bayesians believe the parameters to be unknown but can be described by probabilities. (So…that would make Fisher’s maximum likelihoods a closet Bayesian statistic, wouldn’t it?).

TJ gave us some great examples of Bayes Theorem applied to real life, like the probabilities in clinical trials with cancer treatments. However, we never really got to see how Bayesian inference affects the experiment’s statistics and experimental design.

To understand the experimental design, we need to understand exactly how experimental design is updated or modified by Bayes Theorem, generally. Let’s say you are going to flip a coin 10 times and you suspect a probability distribution to describe these coin flips. Therefore, h would represent the probability of heads, and p(h) would represent the distribution settled on prior to any coin flips. Then the coin is flipped and way more heads come up than usual, say 8 heads. By using Bayesian inference, we need to update our prior belief about the coin – it’s now unfair. So our new beliefs may be modeled like p(h|f) where f is the number of heads experienced in those 10 flips. This abstraction is read as “what is the probability distribution of heads given the number of heads resulting from 10 tosses [in this case 8]?” This seems like a reasonable update as we pare down our hypotheses to fit our experimental data. Mathematically, the update would look like p(h|f) = u(h, f) x p(h) where u(h,f) is an updating factor written out as u(h, f) = (l(f|h))/l(f) where l(f|h) is a likelihood function or the probability we observed 8 heads given the parameters we modeled in the beginning. The denominator of the updating factor is just the likelihood of the data under no conditions. Because Bayesian statistics doesn’t believe parameters are fixed, they can have conditions added to them. Therefore, the likelihood of the data can be written as an integral l(f) = ∫l(f|h)p(h)dh (this is similar to a general expectation value). The denominator turns out to be a weighted average of likelihoods across all possible parameters. Or simply, a ratio that is able to tell you what parameter values are most likely.

Darth Vader: crafty with a lightsaber and some conditional probabilities.
How does this play out in the lab? Let’s take a hypothetical animal trial where dose concentrations of many drugs are tested on large amounts of animals to test their potencies. The lab wants to apply regression analyses to the different drugs based on the specimen they inject the drug into. For experimentation of one drug, the experimental design included six equally spaced doses given to ten mice each; so, 60 animals to test a range of concentrations for one drug. The investigators measured the number of surviving mice one week after drug administration. It turns out that about 90% of mice died at high concentrations of the drug, while 10-20% died at low concentrations of the drug. After each of the experiments, maximum likelihood estimations were used to estimate an LD50 value (or the dose at which the probability of mice dying is 50%). As it turns out, the investigators used results from the first few sets of experiments to predict a distribution for following experiments, in anticipation of constructing an updating factor, as described above. In total, if 50 drugs are tested with similar experimental design, the investigators can use these 50 LD50 values as a sample from a distribution of LD50 values.

Overall, these Bayesian inferences and the statistics are mathematically rooted in Bayes Theorem. This theorem relies on conditional probability. These conditional probabilities make the system easy to update and a noteworthy design for scientists to consider -- because writing grant proposals on frequentist assumptions can be dangerous when we try to predict a model for data without any prior knowledge of the system.  

Thursday, October 13, 2016

Considering Publishing Ethics as Research Ethics: On Recent NCATS Clinical Studies

Five years ago, Director of the National Institutes of Health (NIH) Francis Collins, was met with much criticism (mostly from big pharma execs) when he proposed a publically-funded translational medicine institute at the NIH. The National Center for Advancing Translational Sciences (NCATS) has since been up and running at the NIH, and has been under a scrutinizing eye of many in the translational research community.

A slide produced by Vtesse about VTS-270
In its short history, NCATS has had some promising breakthroughs: finding more than 50 chemical compounds that block the Ebola virus from entering cells, for instance. However, much money has been spent on VTS-270, a mixture of 2-hydroxypropyl-B-cyclodextrins, which has been shown to be a potential treatment for the deadly childhood disease Niemann-Pick Type C-1. The process in developing the compound has been sped up thanks to collaboration between NCATS and a Maryland-based company called Vtesse.

Vtesse has recently been running late-stage clinical trials of the drug by injecting the large sugars into the spinal fluid of the lumbar of patients with the disease. Prior to the lumbar puncture, the team also tried to use implanted reservoirs similar to those used to inject chemotherapeutic agents in the brain, in the brain’s ventricles. However, reservoirs in two of the three children in the study became infected.

A diagram of the reservoir system used to inject drugs into patients' brains
The mother of the twin children who had infected reservoirs is now angry that the recently published article in Current Topics in Medicinal Chemistry had no mention of the failed direct-brain administration of the compound. Although the initial research article was submitted only 9 days after the conclusion of the clinical trial in April 2013, the study was not published until 2014, almost a year later. Chris Hempel (mother of the children), brought this information to the attention of the journal’s editor. A correction has since been published, illustrating the shift from direct brain injection to the lumbar puncture technique. The clinical trial was put on hold soon after.

The case of NCATS and translational medicine brings up an interesting perspective on intent versus perception in research, as well as the ethics of research and how they parallel publication ethics. When asked about the correction, Chris Austin, the director of NCATS said, “The theme of the particular issue of the journal in which the article was published was collaborative science, and therefore the article was focused on the process and collaborative environment contributing to the development of the drug. The information regarding the clinical trial is currently being written for submission to a research journal.” It shows that this particular group of scientists believe there is a clear correlation between intent, theme, and release of certain information. However, this correlation seems to have caught the scientists in a conundrum.

One can only assume that because the scientists felt that they were publishing an article with a specific theme, they believed it was OK to omit certain information. However, the scientists completely ignore the fact that they fall under the NIH’s Ethical Guidelines for Clinical Research – specifically respect for enrolled participants by informing them of “new information…that might change the assessment of the risks and benefits of participating.” In addition, by not immediately releasing the results of the infected reservoirs to the public, they put potential participants at risk. What if another company has a similar idea and further patients become infected because they hadn’t heard of previous studies? Making both other researchers and participants ignorant of infection risk is not justifiable for thematic harmony of one article submission.

In addition, it appears that the authors specifically violated certain publishing ethics set out by both European and American societies on publishing and the scientific community, as well. In publishing, there is a documented “Seven Sins” of ethical breaches in publishing: carelessness, redundant publishing, unfair authorship, undeclared conflict of interest, human/animal subjects violation, plagiarism, and other fraud. And it’s that tricky last category, other fraud, that gets these authors. “Cooking” is defined as the selective reporting of one’s data. Let’s take a step back and think about this problem as a philosophical – even economical – one. Certain taxpayers made the grants and funding possible for this research to be done; under a tacit social contract and the not-so-tacit oath of the scientists, they agree to serve the public and not involve themselves in “evasion.” Sure, it seems nit-picky, but responsibility, integrity, and honor play an important part in institutional trust and, macroscopically, public trust of science.

Tuesday, October 11, 2016

Does Misuse of Statistical Significance Give Us a Higher Likelihood of Immortality?

This past week, a letter in Nature was published by the Department of Genetics at Albert Einstein College of Medicine in New York. The letter’s title was pithy and shocking “Evidence for a limit to human lifespan.” GASP! “But this can’t be true,” some may say, “with advances in medicine, we’re only continuing to lengthen the human lifespan!” First off, that’s what I think is a bit of a teleological argument – assuming just because humans have certain intelligences about the world, our purpose is to expand life or live forever. There’s some sort of moral purpose that is implied in that answer. Anyways, that’s not the point, but it sets off an interesting point of debate

Maybe we, humans, aren’t supposed to live forever. We forget that we are but specks on the lithograph of time; we’re babies! Only 200,000 years has the species Homo sapiens existed. On a 4.5 billion-year old planet. Many things have lived before us, and many things will live after us. Perhaps this sounds a bit nihilistic, but I was happy to hear these reports from these geneticists. Humans are single-handedly taking the gauntlet to ourselves and our planet, making sure life, in any form, is going to have a hard time existing on a planet where the temperature is rising almost 2 degrees Fahrenheit each year. So yes, I was overjoyed, ecstatic, and relieved to hear we may be shortening the execution of our planet Earth -- until I heard these geneticists used statistics to come to their results. Oh, great, another potential mishap with statistics by those not formally trained in statistics, I thought.

The methods that the study employed were actually pretty simple. Most people who took a basic stats class could probably understand everything up until the cubic smoothing splines. The authors plotted the maximum reported age at death (MRAD) for 534 people over the age of 110 or what they call “supercentenarians” gathered from the International Database on Longevity. These supercentarians came from Japan, France, US and the UK. Two linear regressions were done for subsets of data between 1968-1994 and 1995-2006. The scientists found an increase in MRAD of 0.15 per year before 1995 (with r=0.68 and p=0.0007) and a decrease in MRAD of 0.28 years from 1995-2006. There was a r=-0.35 and a p-value of p=0.27 for the decreasing MRAD. The scientists didn’t expand on the statistical shortcomings of their decreasing MRAD points. The scientists conducted the same procedure with other data points, and found a similar overall trend – that is, a significant increase until breakpoint and an insignificant decrease after breakpoint -- but did not discuss the weak correlation or significance.

The initial criticisms of the work have been predictable, to say the least. Why didn’t the scientists discuss their dismal p-values? Moreover, some say that seeing a significant increase in MRAD then observing a decrease, even if it was insignificant, is still something. This is, however, an amateur cop out, in my opinion. There are so many more reasons to be critical of the work than just the terrible p-values, and it starts with study design.

The main figure in Dong et. al. and the bane of my existence these last four days...

First, it’s really not clear why the scientists decided to use the arbitrary breakpoint between 1994 and 1995. Clarification of this would be helpful, if not crucial, to understand, as the entire crux of the paper’s argument leans on this breakpoint and its subsequent data analysis (Figure 2a). It appears that the breakpoint was chosen arbitrarily to support their initial hypothetical claim that humans have reached a plateau in age advancement, and the linear regression decrease model in MRAD is used in a rhetorical sense, rather than in a strict statistical-sense of proving their ad-hoc null-hypothesis statistical test (NHST). Is it even right to use NHST here? I’d argue no. If their alternative hypothesis was that r=0, it’s increasingly hard to test effect sizes closest to zero – you probably need to have a larger sample size, and with this data, data that exists on the margins of collectability, that’s hard to do.

The authors do make an implied comment on the power of their statistical tests, noting that they probably don’t have a large enough sample size to make case for a more robust statistical model. To alleviate this, they applied a post-hoc sample expansion to include several RADs as the highest point, including the MRAD and the second through fifth RAD; this appears to be more or less a sneaky way of doing away with outliers to me. They concluded that the average RADs had not increased since 1968, and that “all series showed the same pattern as the MRAD.” But instead of applying an ANOVA to test their hypothesis about average RADs not affecting the means, they never reference any supplemental material. It appears that they just eyed it. Another suggestion would be to compare the slopes of the “plateau” regions by doing some sort of t-test.

In the end, the statistical procedures taken to prove their own point would have any data scientist’s head spinning; probably enough so that people just wouldn’t take a hard look at it for too long. This is dangerous, but we can’t say Nature hasn’t committed the crime before: many times, the journal has published funky stats simply because the title of the study was provocative (as demonstrated by our class this semester). Nonetheless, you have to tip your cap at the scientists who wanted to make a headline, they surely did it, then perhaps bonk them on the head with your cap and tell them to do better stats before they make such grand claims.