Tuesday, April 5, 2016

Simpson's Paradox

Statistics to many is a non-intuitive, mathematical jungle fought through by hours of rigorous study and contemplation. Even to those that are well versed, there are still aspects of statistics that seem to defy logic. One phenomenon that illustrates this paradigm well is Simpson’s paradox, which is defined as a set of data where a trend observed in each of the individual groups disappears or reverses when the groups are combined. This paradox has been shown in data collected from baseball, clinical research, and sociology studies.

For example, let us consider a group of five individuals: Josh, Michael, Jessica, Faith and Marcus. Each of these individuals decided after watching the Nathan’s Famous hotdog eating competition that their new passion in life was to become a competitive eater. Over a ten-year period, each of the individuals competed in five competitions and enlisted the help of a statistician to analyze their progress. As you can see in the graphs below, through rigorous training each of the individuals managed to steadily increase the amount of hotdogs they could consume in a ten-minute period over the ten-year time period they were followed.
From this data, one may conclude that overall individuals will tend to increase the amount of hotdogs they can consume in a ten-minute period. This seems logical considering there is a strong upwards trend in each of the graphs above. When all of the data is grouped together, however, the opposite trend appears. As shown in the graph below, the grouped data from the five individuals clearly shows that the overall trend between hotdogs eaten in ten-minutes and age is negative.
 
This is Simpson’s paradox in action. The trend observed in each of the individual data sets reversed when these data sets were combined.
Personally, I believe this example, and many data sets in which Simpson’s paradox is observed, should raise some questions. Do these data belong in a grouped analysis? Are there confounding variables that can explain the trend in the grouped data relative to the individual data? Were the proper experimental and statistical protocols followed?
In this example, one could argue that the data do not belong in a grouped analysis since each individual began competitive eating at a different age. If the data were plotted with the x-axis as years after beginning competitive eating, the overall trend matches the individual data as seen below.
Overall, if data you are analyzing displays Simpson’s paradox, it may be wise to step back and analyze your data collection and analysis. Although the data may just display a paradoxical trend, it is quite possible that the trend arose from flawed collection or analysis. 

Variables in the Publix Marathon



In class this semester, we learned about two different types of variables. Variables may be continuous or counted. Counted variables are, perhaps obviously, things we can count. They are represented by ordinal numbers. Let’s look at these two types of variables in a real life example: the Publix marathon. The Publix marathon took place in Atlanta a few weeks ago. There were 1,378 participants. There were 909 male participants. The number of people is a counted variable. What continuous variables are there in the Publix marathon? Time is a continuous variable.

You might think time isn’t a continuous variable because you can count it. You can tell me what time it is: “It’s 12:30”. Looking at the difference between second and third place overall winners (which can be found here), minutes are specific enough to separate them. There are 13 minutes separating these two runners. Sometimes though, time needs to be more specific than just hours and minutes. The 9th and 10th place finishers overall, Jacob and Jeffery Law are separated by 1 second. These two, who appear to be brothers, ran the whole race together with the same pace of 6:46 per mile but in the end one of them had to finish before the other. Since time is continuous, Jacob, the older brother, gets to brag about beating his brother in a marathon.

Still, we all learned that a second is “one Mississippi” in kindergarten. We can find things that are measured in less than a second by turning away from an everyday example and back towards science. In my lab, to determine enzyme kinetics we use a rapid flow quench machine. 

This machine allows for a minimal reaction time of 2.5 milliseconds! An example of the data generated from this machine can be seen below where one student in our lab collected five data points in 0.5 seconds.


Whether you are trying to win a marathon or determine the kinetics of your favorite enzyme, the fact that time is a continuous variable is a good for you. 

Statistical Tests: Don't forget to use your brain

Statistics do not make you smart. Common sense is not to be scoffed at. Canonical designs are often outdated and bad. Statistical tests are not sufficient to determine if your data is acceptable.

Our world changes, our science changes, and we get smarter. Why do we allow ourselves to be crippled by the ignorance of the past? I am a structural biologist, I have worked with x-ray crystallography for several years. Crystallography relies heavily on a set of equations and statistical parameters that determine whether or not our data is “good.” We learn these rules as absolutes when we get started - but every year I learn again and again why these rules are idiotic.

The first rule - where do we throw out data? Anything with a signal:noise below two of course! Except.... Our data collection method involves shooting x-rays at an ordered crystal lattice of our protein and observing the scattering of those x-rays when they interact with electron density clouds around the protein. We use the repeated nature of a crystal lattice in combination with the scattering pattern to work backwards and build the electron density. The more reflections we measure, the more we know. Some reflections are weak, some are rare and not often repeated. Throwing out data that doesn’t have a signal:noise above two is still throwing out signal. Why would we throw out our signal?

We have more rules for when to throw out data. There is a test that measures variance within the data set. As you add more data, the variance grows. As our methods improve and we can collect more data, our statistics actually get worse. We get punished for having a stronger crystal that can handle more exposure. We get punished for having a better detector that picks up more signal.

On top of all of this, we have a series of modelling steps that check us as we model. Are we biasing our system? Does the original data still fit? Except this method of checking is itself biased.

Why do we still use these tests? It is because they are written in all the books, they are hammered into us constantly, and we simply do not use our brains. These tests are presented as our sacred way of doing things - but sacred ways tend to be outdated, inappropriate, and written for a different time.

I urge using your brain over trusting the statistics. I bet they are done wrong the majority of the time, and there is no reason to allow yourself to be idiotic and blindly trust in them. Papers use multiple t-tests to compare several groups rather than an ANOVA. We can use a variety of outlier tests on data that looks “wrong” until something says it is an outlier, but were we using the right test? Statistical tests are a tool, but not a rule. They are not the science and they do not determine everything.

Monday, April 4, 2016

What's in a correlation?

What is correlation? Or perhaps a better question is, what is a good correlation? The answer isn’t very straightforward. I made up the following data to see if there was a correlation between a person’s midi-chlorian count and the number of soft drinks they consumed during a year.



The correlation statistics are as follows:
            p=0.0002
r= 0.3483
            r2= 0.1213
            n=111

The p-value is very small, so you could conclude that there is a strong relationship between midi-chlorian count and soft drink consumption. If there really was no relationship between midi-chlorian count and soft drink consumption your chance of obtaining a correlation this strong is very unlikely. However, the r2 value is very small. This suggests that only about 12% of the variation in midi-chlorian count is explained by variation in yearly soft drink intake.


So, which measure do you look at to judge the correlation? The p-value is really small, which suggests that the correlation is unlikely to occur coincidentally. However, its important to remember that p-value is highly dependent on sample size. This study samples 111 individuals, and with a sample size that large, very small effect sizes can become statistically significant. The effect size is small since r2=0.1213, so we are left with the question, is an effect size of roughly 12% scientifically important? This is a difficult question to answer and it’s probably best left to the judgment of the scientist or the reader. I think this problem raises an interesting question about the strength of correlations reported in media. The news is full of correlation data between various categories, but how are the strengths of these correlations being judged? Do journalists and scientists look at low p-values and decide that a correlation is strong or do they look at a high r2 value and effect size? An alternative to this is for journalists to publish the actual data, and let readers conclude whether the correlation is strong enough to warrant action or consideration.