Unbiased Research

Monday, April 4, 2016

The cause of feelings of hoplessness and failure in graduate school: P-Values and Statistical Significance

Scientists, especially graduate students, have become too focused and driven on results being statistically significant. We play statistical significance up to be all-important in science; most of our experiments and projects focus on finding some difference. If we don’t get the results that are “statistically significant”, we feel like failures and that something went wrong. Maybe I am generalizing too much of my own experience in graduate school, but bear with me. Graduate school is notoriously viewed (well, at least by me) as “soul-sucking.” I believe that much of these feelings of hopelessness and failure originate from the moment you press “Analyze” on Prism and see “ns.” Imagine how different graduate school would be if that feeling of failure were eliminated…how would things be if we took every negative result and no longer viewed it as a dead end or a reflection of our abilities as scientists? What if when we saw “ns” we could feel joy and not distress? I feel like our success in graduate school is defined by statistical significance; without a p<0.05, our hard work means nothing. When was the last time that any of us went to a thesis defense that focused on non-significant results? Why has it become that a statistically significant result is necessary to earn our doctorate? Would our education be at a disadvantage if were not required to present statistically significant data?

In a way, statistical significance helps to remove bias by allowing for quantification and comparison of results in order to look for a difference. Statistics and calculating a P value are what allow western blots to be informative and unbiased. Without P value, there would likely be variation in what some would say “looks” like a difference between two groups. Science needs statistical significance. However, statistical significance has also created bias in the way that we approach problems. The need for statistical significance prevents us from exploring concepts and hypotheses that may turn up to be of no significance. The need for statistical significance may also lead a researcher (without proper statistical training) to increase the n of their experiment to the point where a p value of <0.05 is inevitable. It has become unacceptable to just say no significance; we force our P value to mean something, even if it’s just “trending” towards significance. Statistical significance and p values both eliminate bias as well as create it.

I feel that people don’t actually think about what “statistically significant” means; all a P value can tell us is the probability that we could see that a result of the same magnitude if the null hypothesis were true. It cannot actually tell us how likely the alternative hypothesis is true. Thus, we need to stop defining the importance of our work by the P value. Motulsky brings up that colloquialisms may contribute this problem of focusing on statically significance a P values. We associate the term significance with importance, which is incorrect when interpreting statistics. In order to interpret statistics, one must understand the theory and definitions of the terms used. Then, and only then, can we understand that statistics does not interpret the importance of our experimental results; it only allows us to accept or reject the null hypothesis. We can no longer define our work and goals by “statistical significance”; instead we should be seeking scientific importance.

Coincidences and Bias

In the "Introducing Statistics" section of Intuitive Biostatistics, the author explains how probability and statistical thinking are not, in fact, intuitive. As humans, our brains are hardwired to look for patterns, even in data that was actually randomly generated. As an example of this, the author points out that coincidences are actually much more common than we realize because "it is almost certain that some seemingly astonishing set of unspecified events will happen often, since we notice so many things each day" (p. 5).

I have often noticed that soon after I learn a new word, I will repeatedly hear that word used over the next several days, even though I could never remember hearing it before in my life. This happened to me a lot when we learned SAT vocab in high school English, I remember repeatedly hearing and reading the word "gregarious" outside of class after first learning what it meant. Apparently this is a common enough phenomenon that it has a name; it's called the "Frequency Illusion" (or the "Baader-Meinhof Phenomenon"). Basically, learning a new word primes the brain to pay more attention when that word is heard again. I probably heard "gregarious" no more often during my sophomore year of high school than in the previous fifteen years of my life, but because my brain was more aware of this word right after I learned it, it seemed like I was hearing it all the time.

This example illustrates how pattern-seeking and unconscious biases can influence our perceptions, even in the most mundane of situations. Because scientists are also human, our interpretation of our findings can also be affected by these cognitive biases. Statistics and rigorous experimental design are imperative to prevent our biases from clouding our scientific judgement, causing us to seek results where no pattern actually exists.

Sunday, April 3, 2016

Bad Stats

Here's my contribution to the bad stats genre.

Yes, bad stats can include bad data visualizations.

Chicken or Egg?

Every discipline seems to have its own version of the chicken or egg debate. For statistics, the debate could be: Which came first, the model or the data?

The answer would initially appear to be quite obvious. The data must come first, of course. As Harvey Motulsky states in Intuitive Biostatistics, “Regression does not fit data to a model…rather, the model is fit to the data.” In other words, the data are used to calculate the parameters of the model. That basic premise is quite clear.

Where it gets tricky, however, is in the selection of which type of model to use. There are linear, logarithmic, dose-response curves, binding curves, higher order polynomials, etc. To navigate these many options, Motulsky again has sage advice: “Choosing a model should be a scientific decision based on chemistry (or physiology, or genetics, etc.).” So if your data are measurements of radioactive decay, you know that you should use an exponential decay model. Or if your data are measurements of the effect of a drug, you know that you should use a logarithmic dose-response model. Even if the R²value for a different model were higher, it would be inappropriate to try to fit your data to a model that you know does not make sense biologically, chemically, etc.

But that is where the chicken or the egg question comes in to play. How does the initial model for a particular biological or chemical system get established? Surely someone, at some point, had to try several models and find the one that fit best for that type of data. If no one has ever established a model for a particular system, you cannot follow the statistical best practice of selecting a priori that you will fit a particular type of model to your data (as in (A) in the figure below). Instead you have to try several types of models. Rather than using the data to determine the best-fit values for parameters of a model, you are now using the data to determine which parameters even need to be calculated in the first place. Certainly this process will still be driven by the data, and rigorous statistical tests can be applied to determine which model fits the data better. Nevertheless, as the figure below demonstrates, this situation (B) fundamentally alters the workflow needed to test your hypothesis. Once the appropriate model for a system is established, the workflow can return to that outlined in (A), but the initial need to establish a model flips the data-model relationship somewhat.

So although the data do always inform the model (and a model is always fit to the data, not vice versa!), there are situations where the data come first and situations where the model comes first. As with most chicken or egg debates, perhaps this one does not have a definitive answer either.

Friday, April 1, 2016

Big Data and Big Bias

Contemplating big data?

Here's a good thread worth thinking through. h/t @ereinerts

My advice in terms of how to begin thinking about this: Complex data coupled with sophisticated analytics isn't some magic woo-woo soap that mysteriously washes away our biases.

Click here for a relevant kerfuffle (warning, website slow to load).

Advice from @dgmacarthur to his team following yesterday's Ioannidis lecture on reproducibility pic.twitter.com/STnHUBGCDM
— Eric Vallabh Minikel (@cureffi) March 18, 2016