Monday, May 2, 2016

Reading Boxplot used to deal with skewed Data

Recently I came across some data that were gathered from human patient samples.  At first glance, it was really difficult for me to understand how to interpret the data, and why the data was generated this way but not the other way around, for example, t-test or ANOVA (See below the figure).  And it is also confusing to read the black bold line, the whiskers and the dots.  I talked with the scientist who sent me the data and combined with what we've come across during this semester and got a general idea of how to read the data.

When we are dealing with patient expression data, often times they are not normally-distributed, but they are "skewed" as shown by the example below.  The mean of the data is at zero represented by u.  However, as it is "skewed", the u (dotted middle line) is a not as informative as a median( the line).  Also of notes, when there are a few outliers in the data collected, they couldn't be excluded when statistical analysis is performed.  We don't know for sure if outliers are caused by measuring errors or just a true responses of certain patients.  In this case of not excluding, the inclusion the outliers could change the mean of the samples and makes it quite far from the bulky of the data.  This makes the median preferred to the mean in human patients data.  

So now we know it is totally different from data drawn from the normally distributed population. With that in mind, it would be much easier to understand the Boxplot.  As shown by the following image, the box represent the area of values between first quartile (25%) and third quartile (75%). The line represents the median, which is usually not in the exact middle of the box.  The whiskers represent the minimum and maximum or 1.5IQRs(Interquartile Region).  The asterisks or hollow dots represents the outliers.  

Now it seems not difficult for me to understand the first figure.  Then how to know if there is a significance when we compare one group to the control?  It's not that robust as compared to t-test or ANOVA.  p-values could be generated using some software.  A very small p-value makes it statistically significant. However, even the p-value is larger than the type I error threshold we set, we couldn't say there is no difference.  

There could be more about boxplot.  For example, in some cases scientists could do the non-parametric tests like Wilcoxon signed rank sum tests if data variation is too large or the parametric tests don't apply to original dataset (the first figure is an example of this).  There is always more in statistical methods and that's why it needs scientists intuitive to perform good stats.   


1 comment:

  1. Box plots used to always confused me as well. Thank you for the detailed breakdown. It's clearly for me to see why one might want to use a boxplot instead of mean when data is very skewed. The inclusion of the median and the visual representation of the concentration of data is helpful with boxplots. One thing - I'm a little confused now at the substantial number of outliers listed above the box plots in the first graph. While I have no information about the graph, and thus don't know sample sizes, it seems to me that a large number of outliers are concentrated in regions far above the other sets of data. This got me thinking about a few questions. How many outliers is appropriate to have on a graph? At what point does a group of outliers constitute its own separate, valid cluster of responses? Is there a limit to the number of outliers one can include in data representation, or can the number be (theoretically) infinite?