Tuesday, April 12, 2016

To vary or not to vary (the bin width)?

There are several ways to graph continuous data. One of most popular graphical representations for this type of data is the frequency distribution histogram. Because there are infinite values possible within the ranges covered by continuous data, they cannot be graphed using discrete values or categories on the X axis. Instead, the range of data needs to be divided into smaller ranges that fit together, called “bins”.

It is general practice to use the same width for every bin on the X axis. Why has this become the standard in the presentation of data? Nicholas J. Cox from Durham University presents some arguments for and against the standardization of bin widths on a Stata Software webpage. Cox notes that the division of continuous variables into bins already has some arbitrariness, and adding varying bin widths to this would be unnecessarily complex in most cases. I believe that, in addition, this added complexity could be used to mislead others through data representation. Say, for example, you are looking to plot a frequency distribution for the change in blood pressure of a group of patients before and after a drug regimen. You measure and obtain every data value on your own, so that no data is sent to you from an outside source. You choose the same bin width for every bin, and plot your data—only to find that you have an outlier that creates a bar to the right of several unfilled bins. The appearance of this outlier bothers you, so you combine this bin with all of the unfilled bins to create one large bin width. Now it looks as though the outlier is simply a right skew to the data, instead of the single value that it truly is.

Cox explains that sometimes we aren’t lucky enough to generate our own original data, however, and receive data from another source that has already been grouped into varying bin sizes. The bin sizes can’t be changed without knowledge of each individual value.  A non-biological example is illustrated below, though a similar situation could easily be found within the biological sciences. This table, with data on travel time to work (from the 2000 U.S. census), shows bins with widths of 5 from 0-45 minutes. 45-60 minutes is the next bin, followed by 60-90 minutes and 90-150 minutes:

A frequency distribution of this data would require bins of varying widths because we have no other information about how the data could be arranged. Alternatively, we could create a frequency density distribution, or a plot with frequency/bin width on the Y axis instead of simply frequency. This can be seen below:



This sort of graph is more informative when one must include varying bin widths, as the area of each bar is the number of individuals represented by the bar. 

            In short, changing the bin widths in frequency distributions has a large impact on how the data appears to the viewer. Unless continuous variable data is received in groups of varying sizes, it is more straightforward to present them using a standardized bin width. 



1 comment:

  1. Great points! I completely agree. Since, in theory, it is our statistic's responsibility to convey the "significance" of our data we don't really have to have graphs. They're there to accurately represent our data so people can best conceptualize the effect size, or in the cases above, the distribution. I think that often scientists are just as guilty of manipulating their graphs to contrive effects as they are of p-hacking. It's both easy and tempting to tweak axes to exaggerate an effect to make a point. Perhaps planning the graphical output should be done before an experiment just as the experimental design is.

    ReplyDelete