Whenever I think about the issue of bias and irreproducibility in science, there are two quotes that come to mind.
“73.6% of all statistics are made up.” – Mark Suster
The second quote was popularized by Mark Twain, who attributed it to Benjamin Disraeli:
“There are three kinds of lies: lies, damned lies, and statistics.”
How do these quotes relate to the issues of reproducibility and bias in science? The irony of the first quote is it is itself a made up statistic, meant to demonstrate that people will parrot figures without first validating their veracity. The second quote highlights that statistics can be deceitful if misrepresented. Combine misleading statistics with the repetition of false information, and you have a crisis in the validity and reproducibility of scientific data. You do not have to go far to find proof of this phenomenon: This article discusses the source of hype around new cancer drugs, which stems from both journalists and scientists repeating statistics without understanding the full context of the situation. Yet, it is not just scientists who do this. How many times have you or a Facebook friend read a statistic and then repeated it, without understanding where that number came from? Misleading facts combined with repetition without confirmation means it is very easy to fool ourselves into thinking there is something in the numbers when in actuality, there is nothing.
To demonstrate how easy it is to fool ourselves, take a look at the graph below:
These graphs look related, right? An r value of .666 is not terrible. Let’s add some labels.
Do you believe this graph? It seems pretty reasonable, right? But let’s look at what the graph actually represents.
Surprise! This a spurious graph where the two variables have nothing to do with each other, yet look related because of the way the data is represented.
The point is, statistics is tricky. It is easy to ignore facts and justify what we want to see, especially when it benefits us. I think this may be a big reason why science is currently in a data crisis; it is not necessarily out of intentional malice, but rather because human beings are inherently bias, and we inherently make connections, false or not, between data sets. Of course, there are those who intentionally falsify data or manipulate data to fit their theories, but that’s a whole other topic.