Unbiased Research: The Trees for the Forest

Tuesday, April 12, 2016

The Trees for the Forest

I picked up my first pipette in 2009 in a virology lab that focused on a single domain of a single protein of a single virus. The majority of our assays were in vitro, and even cell culture was used sparingly. We essentially studied a single cog of a machine outside the context of that machine, and this was standard practice in the field.

Today I am sifting through massive piles of RNA sequencing, proteomics, and metabolomics data in an effort to see the machine more clearly. As our high throughput and analytical methods have progressed, scientific research has gained the ability to investigate not only the intricacies of biology, but also the opportunity to examine the bigger picture. With that opportunity, however, come significant challenges. We can spend months chasing down artifacts or trying to piece together seemingly contradictory data with little success. Statistics is our only guide in this process, and many of us have only rudimentary training in the nuances of analysis necessary to navigate these mountains of data.

The days of “your favorite protein” may be coming to an end, replaced by a more complex and wholistic approach at the bench. Amin et. al. describe this new approach in an excellent article about systems biology and the rise of big data. They summarize the “omics” approach in the figure below, and argue that research must move away from any one aspect of biology and begin to integrate all approaches into the investigation of a scientific question.

With this more complex approach comes a need for more advanced statistics in graduate and even undergraduate education. Big data has become an integral part of our scientific lives and increasingly an important aspect of our personal lives as well. In the midst of this contested presidential election, popular interest in big data is incredibly high, and interested onlookers would do well to familiarize themselves with the art and science of statistics.

9 comments:

UnknownApril 14, 2016 at 10:57 AM
I agree with you on the fact that we need a better statistical education, starting very early on. On the other hand I don't agree completely with the article. Although it's true that biology is moving more and more through an "omics" work that requires statical knowledge, in order to do the "omics" you need the "favorite protein" part in order to understand and have a critical thinking about the data that come from the huge dataset of the "omics" analysis. I think that one approach doesn't have to exclude the other, but both have to be integrated in order to understand better the big picture and statistics is an important tool that has to be part of every scientist knowledge.
ReplyDelete
Replies
UnknownApril 15, 2016 at 12:16 AM
I agree with you. As scientists, we sometimes focus on answering one very specific question. However, we sometimes do not realize that everything is part of a system and that in order to understand a certain aspect, you need to expand your vision and look at other parts of this system. In Biology, there are many interactions we do not realize exist until we specifically look for them. This is where the "omics" come in. With them, we can complete the picture and can see the effect that "your favorite protein" has on other aspects of the said system. It is because of this that I also agree with the statement about having more advanced statistics courses at the undergraduate level, given it is useful to think about these aspects as you are starting to develop as a scientist and, as a graduate student, deciding the direction you want your project to take.
ReplyDelete
Replies
UnknownApril 15, 2016 at 8:20 AM
YES! Thank you. I completely agree that we definitely need to start raising the bar on statistical training and education for scientists. But I’m also with Camilla on this one. The day’s of “Your Favorite Gene” aren’t drawing to a close, but merely allowing us to take a step back from the entire situation and see the larger picture. I rotated in a lab with a systems biology focus, and my project was directly focusing on viral infections on the single cell level, but it also produced large amounts of data to sift through.
I do think that it’s important to start focusing on a larger scale picture, but doing a completely comprehensive “-omics” based study with all of the “-omics” seems a tad ambitious.
An interesting article I stumbled across warns of the problems in big data (http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz44yGcRv4h). One historical example they site is the election of FDR, where one newspaper/journal polled a huge amount of people, over 2 million, for their voting opinions and projected the republican candidate at the time would win by a landslide. History did obviously not work out that way, but something interested uncovered was that a much smaller paper did a similar poll, using only 3 thousand people, but predicted the actual outcome much more closely. The problem according to the article was sample bias. The larger poll study targeted higher income households indirectly who were more likely to vote for the republican candidate.
A good quote from the article to sum it up is, “data sets are so messy, it can be hard to figure out what biases lurk inside them – and because they are so large, some analysts seem to have decided the sampling problem isn’t worth worrying about. It is.”
ReplyDelete
Replies
UnknownApril 18, 2016 at 8:39 AM
While big data has become "bigger" - is it really something we can expect in every grad student's education? The statistics surrounding this systems biology approach are complicated and incredibly advanced, I am not sure if they are something we can really expect each and every researcher to have a grasp on. Wouldn't a better solution be the use of trained statisticians who understand the complex programming and modeling that we use to dissect these systems?
ReplyDelete
Replies
UnknownApril 24, 2016 at 7:35 PM
Yes, I agree. Especially in the world where computers and machine learning are impacting how we live and do research, this post is very relevant for the present and the future. I just wished that I was more well-versed on this topic. But one issue that I see on big data is noise. We have so much data that it's like finding a needle in a haystack. So even though, we strive to produce these big data sets, we are still not apt to efficient make sense of the data. As more and more big data are being generated, more and more statistical analysis should be created. But I think that the pace at which these softwares or codes or formulas are being created is not on par with the data generation. I can't wait until the days of single protein collides with big data where big data analysis becomes the norm as western blots and PCR are to most laboratories.
ReplyDelete
Replies
JennieMay 3, 2016 at 8:09 AM
I agree wholeheartedly. As a current undergraduate in college, I can already see where an understanding of the statistics of large bodies of data would be useful even at my level. I am currently taking an undergraduate course in Human Genetics, and am very relieved to be in Biostats at the same time. With the past 15 years of genetics studies consisting primarily of GWAS studies, it is impossible to look at this area of biology without being able to understand statistical association in LARGE samples of data. I feel that, without being enrolled in Biostats simultaneously, I would have had a far reduced understanding of these statistical associations as well as notions of risk of disease. At least an introduction to the statistics of "big data" would be useful for any undergraduates with mathematic/scientific inclination, and should certainly be required for graduate students in these areas.
ReplyDelete
Replies
UnknownMay 3, 2016 at 9:46 AM
I totally agree. I think it is become harder and harder to ignore the lack of statistical education especially at the undergraduate level. In this modern era of science and big data it would be incredibly useful if everyone at least came into graduate school with at least correct notions of what a p-value means and basic understanding of statistical inference. I think it's unrealistic to expect all scientists to fully grasp the complex statistical analysis of "big data". But I do think that given the increasing abundance of 'omics' and the advanced statistical approaches associated with this data that we all need a much better understanding of basic fundamental statistical approaches. I mean if you are just out there doing t tests indiscriminately and don't really even know what a p-value means (It's clear there from the badstats assignment there are some scientists like this publishing today) then you have no chance to understand all this 'omics' data. It couldn't hurt to properly educate all undergraduates (not just science majors) about probability and statistical inference. Just maybe one of those lucky students would change their mind about voting for Trump. And even if that's all that happens I think it would totally be worth it.
ReplyDelete
Replies

Add comment