Whenever
I think about the issue of bias and irreproducibility in science, there are two
quotes that come to mind.
“73.6% of all statistics are made up.” – Mark Suster
The second quote was popularized
by Mark Twain, who attributed it to Benjamin Disraeli:
“There are three kinds of lies: lies, damned lies, and
statistics.”
How do these quotes relate to the
issues of reproducibility and bias in science? The irony of the first quote is
it is itself a made up statistic, meant to demonstrate that people will parrot
figures without first validating their veracity. The second quote highlights
that statistics can be deceitful if misrepresented. Combine misleading
statistics with the repetition of false information, and you have a crisis in
the validity and reproducibility of scientific data. You do not have to go far
to find proof of this phenomenon: This article discusses the source of hype around new cancer
drugs, which stems from both journalists and scientists repeating statistics without
understanding the full context of the situation. Yet, it is not just scientists
who do this. How many times have you or a Facebook friend read a statistic and
then repeated it, without understanding where that number came from? Misleading
facts combined with repetition without confirmation means it is very easy to
fool ourselves into thinking there is something in the numbers when in
actuality, there is nothing.
To demonstrate how easy it is to
fool ourselves, take a look at the graph below:
These graphs look related, right?
An r value of .666 is not terrible. Let’s add some labels.
Do you believe this graph? It
seems pretty reasonable, right? But let’s look at what the graph actually
represents.
Surprise! This a spurious graph where
the two variables have nothing to do with each other, yet look related because
of the way the data is represented.
The point is, statistics is
tricky. It is easy to ignore facts and justify what we want to see, especially
when it benefits us. I think this may be a big reason why science is currently
in a data crisis; it is not necessarily out of intentional malice, but rather
because human beings are inherently bias, and we inherently make connections,
false or not, between data sets. Of course, there are those who intentionally
falsify data or manipulate data to fit their theories, but that’s a whole other
topic.
This post humorously highlights the nature of humans to see what they want to see. The number of people drowning in a pool and number of movies Nicolas Cage has appeared are discrete conditions that presumably have no impact on one another. However, anyone with no knowledge of either condition could conclude that they are related to one another. This also brings into light the issue of "correlation vs causation," a major flaw that causes us to conclude that data is related because they look like each other. This would be a more extreme example of the reproducibility crisis as it the data illustrated not only lack significance but would be considered "alternative facts."
ReplyDelete