There are several
ways to graph continuous data. One of most popular graphical representations
for this type of data is the frequency distribution histogram. Because there
are infinite values possible within the ranges covered by continuous data, they
cannot be graphed using discrete values or categories on the X axis. Instead,
the range of data needs to be divided into smaller ranges that fit together,
called “bins”.

It is general
practice to use the same width for every bin on the X axis. Why has this become
the standard in the presentation of data? Nicholas J. Cox from Durham
University presents some arguments for and against the standardization of bin
widths on a Stata Software webpage. Cox notes that the division of continuous
variables into bins already has some arbitrariness, and adding varying bin
widths to this would be unnecessarily complex in most cases. I believe that, in
addition, this added complexity could be used to mislead others through data
representation. Say, for example, you are looking to plot a frequency
distribution for the change in blood pressure of a group of patients before and
after a drug regimen. You measure and obtain every data value on your own, so
that no data is sent to you from an outside source. You choose the same bin
width for every bin, and plot your data—only to find that you have an outlier
that creates a bar to the right of several unfilled bins. The appearance of
this outlier bothers you, so you combine this bin with all of the unfilled bins
to create one large bin width. Now it looks as though the outlier is simply a
right skew to the data, instead of the single value that it truly is.

Cox explains
that sometimes we aren’t lucky enough to generate our own original data,
however, and receive data from another source that has already been grouped
into varying bin sizes. The bin sizes can’t be changed without knowledge of
each individual value. A non-biological example is
illustrated below, though a similar situation could easily be found within the biological sciences. This table, with data on travel time to work (from the 2000 U.S. census), shows bins with widths of 5 from 0-45 minutes. 45-60 minutes is the
next bin, followed by 60-90 minutes and 90-150 minutes:

A frequency distribution of this
data would require bins of varying widths because we have no other information
about how the data could be arranged. Alternatively, we could create a
frequency density distribution, or a plot with frequency/bin width on the Y
axis instead of simply frequency. This can be seen below:

This sort of graph is more informative
when one must include varying bin widths, as the area of each bar is the number
of individuals represented by the bar.

In
short, changing the bin widths in frequency distributions has a large impact on
how the data appears to the viewer. Unless continuous variable data is received
in groups of varying sizes, it is more straightforward to present them using a standardized
bin width.

Great points! I completely agree. Since, in theory, it is our statistic's responsibility to convey the "significance" of our data we don't really have to have graphs. They're there to accurately represent our data so people can best conceptualize the effect size, or in the cases above, the distribution. I think that often scientists are just as guilty of manipulating their graphs to contrive effects as they are of p-hacking. It's both easy and tempting to tweak axes to exaggerate an effect to make a point. Perhaps planning the graphical output should be done before an experiment just as the experimental design is.

ReplyDelete