Tuesday, April 12, 2016

Considering Outliers Across Tests of the Same Parameter (i.e. Mouse Lab Problems)

One of the recurring themes of this course is that we should never throw away or disregard data. (Albeit, so long as it was obtained under sound circumstances. If your definition of data for RT-PCR is making up numbers for Ct values those should probably go out the window.) We fought hard to get that data, and it tells an important story. While there is always a degree of expected variance in the data, sometimes it goes above and beyond, and we have perceived statistical outliers.


While my project focuses on the use of human and mouse cell culture, my lab as a whole utilizes mice as a model system to study voltage-gated sodium channels in various forms of epilepsy. We typically put transgenic mice through a pipeline of sorts, starting with behavioral paradigms and culminating in seizure paradigms. By using seizure paradigms like 6 Hz (i.e. electroshock mice to induce a seizure) or flurothyl (chemically induce a seizure), we can easily determine whether a mouse is seizure resistant or susceptible. The tricky bit is at the beginning, when we do behavioral paradigms to see if they're hyperactive, anxious, or have poor learning and memory compared to a control animal (maybe one that does not carry a sodium channel mutation). Mice inherently are variable in terms of behavior; some are more active than others, some are more lethargic. But this inherent variability makes it really difficult to determine any differences in behavior, because in a single population you have one mouse that is bouncing off walls whereas some just stay in a corner of a box the entire time. We often put mice through multiple paradigms testing the same trait (let's say, anxiety, in this case) to make sure any findings are consistent, and then Grubb the data through GraphPad to pinpoint any outliers. 

That being said, what would be the best thing to do if we find an outlier mouse for one anxiety paradigm, but not for the others? Let's say, in each of the following paradigms, Mouse A is:

Paradigm 1: Not an outlier
Paradigm 2: Outlier
Paradigm 3: Not an outlier
Paradigm 4: Not an outlier

There are two options that I can see. The first is to denote Mouse A as an outlier in Paradigm 2 alone, since that's what the Grubb test told us. We would not remove it from our published data sets or from statistical analyses, but we would label it as an outlier on graphs. The second is to mark Mouse A as an outlier in each of the four paradigms, since they all measure anxiety, and it was anxious in one of them. Is that the right thing to do? Is even considering the mouse in these data sets if Grubbs labeled it as an outlier the right thing to do (well, yes, but it's still debated)? One could argue that since these paradigms test anxiety in different settings, the results from one are self-contained, meaning that Mouse A may just not respond well to a particular test, but the other camp also has a valid argument in that it's fishy if Mouse A is "abnormal" in Paradigm 2 but we still consider it as within the sample population for the other paradigms. 

Perhaps it's best to just follow the literature and precedent from other mouse papers, but just because something is precedent does not mean it is necessarily correct.


  1. I think this question poses a frequent consideration we all must take time to decide prior to experiments. For example, how much information can we take from preliminary data that may affect our design, this information may not be necessarily directly tied to the outcome of experiments but is crucial for an experimenter to decide methods, numbers, evaluation. Again as you mentioned, setting a standard that is based on previous work may be helpful when considering the standards for publication, but is it the same criteria that can be used for determining study design? In this case I wonder if doing so in replicate per mouse would be helpful in determining anxiety? Also is there a known standard for non mutant mice/control mice which also have exhibited signs of anxiety? Is the question more about if the anxiety of the mouse following under a normal distribution that is seen in the population? Is it ok for a mouse to show anxiety in 1 of 4 tests or do we set a rigorous level of no anxiety (is that self-selecting for a population of animals that are stoic and unaffected?) For this reason I think anxiety measurements may require more testing replicates to see if the one anxiety is truly an outlier in that animal/population or if the animal repeatedly shows anxiety. Although this may require more rigor in preliminary testing the idea is to help eliminate any question further down the experimental design that may lead to questions in the validity of your conclusions as by chance or for true differences.
    Additional testing and population statistics can lead to show if these control vs mutant mice have baseline differences in the population that can be adjusted for or if the mice can be treated to the same level.
    The question becomes even more complicated when one questions to what level of anxiety inducing stress does each research put their animal models through? Can the stress test itself be variable that may give a marked difference between animals of different testing groups? It has been shown previously that animals in laboratory settings reacted more negatively to male researchers than females? The lines continue to blur in regards to how to properly evaluate a behavior that can be so variable.

  2. I haven't done behavior testing myself, but my lab has a great deal of experience and expertise in it, and I’ve collaborated closely on the data analysis. Behavioral data is highly variable; that’s just the nature of it. To account for that variability, you have to have a large sample size. It’s not uncommon to see published behavior studies using 50+ mice. You have to have that kind of n in order to get a sufficiently robust data set that will allow the identification of patterns and significant effects. If you have a well-powered behavior study, then you can usually have a few overly anxious mice in one test, not exclude their data, and they won’t throw off the conclusions. I think this is the best solution because it preserves the natural variability of the system but still allows for the detection of phenotypic effects.
    The outlier problem is definitely real, though. Especially when you can clearly see a pattern and one mouse deviates from it dramatically, it’s hard to know what do to with the data, what most accurately reflects the reality of the system. Personally, I rarely exclude outliers on the basis of a statistical test. If there’s not a valid biological reason to discount an animals’ data, I leave it in. That being said, you don’t want to include data that aren’t really accurate for what you’re trying to measure. This isn’t an extensive list, but here are some things I normally consider:
    1) Are the values that I’m getting impossible or extremely unrealistic biologically, indicating a technical error? If you’re measuring LPS in blood and get a value that’s incredibly high in a perfectly healthy mouse, then it’s quite possible there was a technical issue.
    2) Is there some reason to think that this animal might react differently than all others in its group? An animal that is significantly smaller than the others might fare worse than a normal mouse in an infection model, for instance.
    3) Do all data for a particular subject deviate from the average? If I see a potential outlier, I check all other outcome measures for that sample. If they all deviate substantially from the norm in the same direction, maybe there’s a problem with that sample. I’ve run into this with normalizations, for instance. If you get an inaccurate measure of total protein and then normalize levels of particular proteins to that, they’re all going to be off in the same way.
    4) Are there several “outliers” in a group? If so, perhaps there are a few distinct responses to a stimulus, and it could be that the variable of interest shifts the frequency of mice from on population to another. For instance, in a recent experiment I did, most mice showed neuron loss in response to a neurotoxin, but several were resistant, and I did check to see if there were more non-responders in one experimental group vs another.
    In the end, if an experiment is sufficiently powered, and you’re measuring a physiologically relevant effect, hopefully your differences will be robust enough to support a few divergent values. If you have to put data through an outlier exclusion process every time you run an experiment, I think that might be a red flag that some aspects of experimental design could be reconsidered.