Friday, April 28, 2017

Visualizing Data

I would like to take this blogging opportunity to spend a moment talking about one of my data visualization heroes, Edward Tufte.



Edward Tufte combines what we tend to think of as the most left-brained fields, quantitative analysis and statistics, and melds it wonderfully with the most right-brained field, art. He has written many books on this subject and discusses everything from politics to how an ineffective presentation style can have awful ramifications, like send a person to death row or may have prevented the Columbia Shuttle disaster.

He is a Professor Emeritus at Yale University in 3 departments: Political Science, Statistics, and Computer Science. He lives on a large plot of land in Connecticut, where he displays his abstract art sculptures.

Tufte has long been a critic of PowerPoint. In the essay "The Cognitive Style of PowerPoint," he lays out many reasons why he believes that PowerPoint is a waste of everyone's time and doesn't adequately relay important information to the audience.

His main point is that a PowerPoint outline forces the audience to think sequentially rather than cyclicly, which makes it more difficult to compare things side by side and understand relationships between concepts. Not all projects should or need to be sequential, particularly when multiple topics are being talked about in one presentation. He says that this is particularly true of statistics, in which the basic idea is a comparison between sets of information. Our brains cannot grasp these concepts sequentially. An alternative? Tufte claims that the best way to get across information is to write up a short handout of information for people to read in the first 5-10 minutes of the presentation. Then put up one figure or data image and have a discussion. Too much information and the audience becomes lost. An audience cannot hold all slides in its head at once, and being overwelmed with information leads to not retaining any of it.

                                Image by Edward Tufte

Click here for an article written by Tufte for Wired Magazine:
https://www.wired.com/wp-content/uploads/archive/wired/archive/11.09/images/FT_pp2_1.jpg

Only 46%!


In a 2013 article published in PeerJ entitled "On the reproducibility of science: unique identification of research resources in the biomedical literature," Nicole A. Vasilevsky and colleagues found that only 46% of studies in biomedical journals had been transparent enough to provide basic information about a number of critical factors, including: strain of model organism, antibodies used, knockdown reagants, constructs, and cell lines. Furthermore, Vasilevsky et al. looked to see whether the data from the publications had been deposited in a data repository, and found that most data had not. They looked across a wide variety of journal metrics, including impact factor, subject matter, reporting requirements, etc., and found that there was no correlation between any of the above factors and whether or not an author published adequate specifics of the experiment.

On a personal level, I have tried to reproduce a number of publications in my own work, and in all of which I have been unable to replicate the results. (And sometimes, even when things are clearly laid out, it is still not possible to replicate the results!) The most baffling aspect to me is that so much work is put into planning, executing, analyzing, and submitting the experiment, that it seems rather ridiculous to not be more clear. Furthermore, all current work is modeled off of previous work. For so much time and effort to go into flimsy assumptions about experimental design is a disservice to the entire scientific community. Scientists with vested interest in patents, corporations, or other conflicting arrangements may not wish to publish the specifics of their methods. However, as a scientist uninfluenced by external factors, I tend to distrust any author that does NOT publish their full data set. Or at the very least offers the reader to contact the author if they wish to see the data set (because not all data types of repositories online, though most do). 

What is to gain by not specifying resources? The fear of being usurped by another scientific group racing to the same subject far outweighs the probability of actually getting scooped. I understand that there are exceptions to these rules because of the possibility of adverse health aspects (for example, identifying certain virulent strains of Ebola or other superbugs). However, the chances of this occurring in all 46% of studies that are underreporting appropriate data are considerably low. More than likely, the scenario is a lack of rigor in the editing and peer review process combined with a conscious or slightly subconscious assumption that the readers will not need or want every piece of data. But if we all took a step back, and reorganized how we looked at publication and review, there's no reason not to be as transparent as possible.

Below is a figure from Vasilvesky et al. 2013. It identifies all types of resources and the fraction from each field that adequately identifies the specifics of that resource.

Avoiding Cognitive Bias in Science

Cognitive bias is a part of our everyday lives.  Social influences, emotions, data processing errors, flawed memories, and the brain’s inability to fully process information all contribute to cognitive bias.  From the celebrity bandwagon effect to the court room, we all experience cognitive bias most of the time without even realizing it. 

There is no question cognitive bias has been a long-standing issue in scientific research.  The pressure to publish in high impact journals with a quick turnaround between publications is higher than ever.  Techniques and methods are becoming so complicated, that researchers often don’t know how to correctly analyze the data or lose sight of the principal question.  Nowadays, presentations of data are completed in a timely manner.  Presenters provide a large amount of data trying to convince audience members of their conclusions.  These presentations happen so quickly, that audience members cannot fully grasp everything that is being thrown at them.  In addition, when statistics are reported and readers see a p value of 0.05 or less they assume statistical significance without ever seeing the raw data or full statistical analysis.  Chances are a lot of data presented are flawed in their statistical tests.  If the raw data is presented it would allow others to conduct their own statistical analysis, which would more than likely result in different findings.


To fix the issues associated with cognitive bias, the field must first recognize that there is an issue.  By becoming more transparent with methods, techniques, raw data, analyses and conclusions in an open science environment, some of these issues will be avoided.  By pre-planning and pre-reporting methods and analysis, a researcher will be tied to their analysis which will eliminate some of the initial cognitive bias that is seen when scientists analyze collected data.  Rather than finding a test to fit the collected data, the data would be analyzed with a pre-chosen test.  In addition, blind data analysis should be used to limit bias.  Rather than collecting enough data to get the expected results, analysis should be conducted blind and confirmed by multiple personnel to further avoid cognitive bias.  Although these are simple steps that can be pursued to avoid cognitive bias in the field, they are not the only solutions, nor are they the correct answers.  They are simple ways that can help avoid biases in science, something that all researchers need to keep in mind when conducting their work.

Sunday, April 23, 2017

Statistically speaking, are you going to snap me back?

Snapchat is a social media platform that allows people to send pictures for a designated period of time. It is swiftly rising to the top of the social media food chain. The easy to use interface is equipped with filters that allow users to jazz up their selfies with backgrounds or costumes.

There are currently 158 million people on Snapchat. There are 80 million American users and 100 million people between the ages of 12- 34 in America suggesting that there is a 80% chance that if you send a Snapchat, as a millennial, you will get a snap back. These numbers are used to generate probabilities. In probability theory, we consider randomness and uncertainty in models but in statistics we make observations to figure out a process.  To generate statistics on whether someone will send you a snapchat in return you would need to establish a relationship. You would need to keep track of past events and establish patterns. It begs the question how can we use randomness to determine statistics.

Chevalier de Mere was a gambling man interested in seeing if the “house” truly always wins. In a game of dice a player would roll four dice and the player would win if no six appeared on any die. The Monte Carlo was created by George Louis LeClerc. The simulation is essentially a coin flip simulator. He used the principle of Buffon’s needle which can be simulated on: http://www.metablake.com/pi.swf. He estimated that the probability of the needles crossing when being thrown randomly is 2/Ï€. The Monte Carlo method has since been modified many times but requires three things: modeling of a system of probability density functions, repeatedly sampling from these functions and computing the statistics of interest. All in all, simulation of data and performing statistical analysis is the intersect between statistics and probability.


Unfortunately, millennials, until a great statistician like you models the probability density function for Snapchats returned we remain with our sad return probability of 80%. Hopefully you become BFFs with Kylie Jenner so she can increase your probability. Until then… Happy modeling!

Reality check on Reproducibility

In 2016, the article, “Reality check on Reproducibility” was published as an editorial in Nature as a response to the 2016 article, “1500 scientists lift the lid on reproducibility.”  This survey firsts ask if there a reproducibility crisis in science, to which roughly half of the respondents answer, “Yes, a significant crisis.”  Surprisingly there was a large percentage of respondents who believe the crisis is only slight or does not exist.  As a graduate student, it is quite interesting to read scientists comments stating that “failing to reproduce results is a rite of passage.”  While this may seem like a valuable learning process for students and post-docs, if they are fully unable to reproduce results, this lends its hand to the larger reproducibility problem.  

To fix reproducibility, we must first collectively define reproducibility with the same language.  When referring to reproducibility, it must consider empirical, conditional and statistical aspects of experiments.  However, even then the criteria for reproducibility are subjective between researchers.  To help fix the crisis with reproducibility, we must all define the term and its criteria the same.  In addition, students should be taught the basics of reproducibility in an experimental design class to provide the necessary foundation for a basic understanding. 

The editorial points out that “Senior scientists will not expect each tumour sample they examine under a microscope to look exactly like the images presented in a scientific publication; less experienced scientists might worry that such a result shows lack of reproducibility.  As a graduate student who looks to publications to compare my work to, it is very hard to reproduce certain findings because of the expectation for my results to look exactly like the images published.  A well written publication that correctly reports its design and conclusions should be reproducible in empirical, conditional, and statistical aspects.  If it is not, communication with the authors of the published document is essential.  After communication with authors, if the results are still not reproducible, the publication opens itself up for retraction.  To avoid this humiliating experience, it is important for labs to take a step back from novel questions and focus on reproducibility within their respective labs.  It is important for members of one’s own lab to be the harshest critic on other members’ work.  This is just one aspect among many others that will help limit the reproducibility crisis in research.