Tuesday, April 12, 2016

Big Data, Big Challenges

Did you know that every day, we create 2.5 quintillion bytes of data? Data from everywhere one could possibly imagine - social media, retail industry forecasts, healthcare software, lottery analysis, performance efficiency, business intelligence, surveillance, product generation, and so much more. Termed big data, an explosion of available information, this trend of incessant data production and storage is likely to continue advancing. Big data lies at the center of a 4 circle venn diagram - high volume, high velocity, poor veracity, and large variety. However, with these inherent characteristics come colossal challenges for conventional statistical approaches. 

First, in this era of scaling up, statistical analyses often fall short. In other words, many popular algorithms that may provide sound statistical significance for a smaller n pool are unable to scale up to the big data power and run terribly slow on terabyte-scale data sets. These outdated statistical tools are simply no match for the massive data they must compute. For example, a basic operation of computing the median of a data set proves to be far more complex when big data is scaled up exponentially. Ultimately, the underlying philosophy of statistics in this context must change from getting the best solution to getting a good answer quickly. To effectively play a role in this high-speed, technologically-charged world, statistical quality control must master a trade-off between speed and accuracy. 

Second, with big data comes large data volume and no random sampling, as contexts are often purpose-driven (i.e. Macy's assessing customer preference on the Women's Shoes section of their website). Traditional statistics models (Frequentist/Bayesian) inherently assume a random sample drawn from a population. However in the big data context, there is an urgent need for statistical methods that are either capable of incorporating non-random samples without presenting inaccurate results, or have built-in abilities to adjust for sample selection. 

Third, big data sets are heterogeneous, meaning they are created via combining several data sources from diverse subpopulations (i.e. a database of SNP sequences).  A huge challenge lies in the fact that most current statistical methods are designed to develop inferences for single populations. In fact, many statistical methods that are successful with low-dimensional data are unable to retain their validity with the high-dimensionality aspect of big data, leading to spurious correlations, noise accumulation, and incidental endogeneity. Statisticians need to take strides towards developing more flexible algorithms such that single populations within a mixture population data set can be identified and individual inferences can be drawn. 

In revolutionizing the world we live in, big data has posed challenges to statisticians where they are forced to grapple with the reality that "concise insights but faster data" may be more relevant/valuable than "deeper insights but slow data." For further reading on this topic, I highly suggest this phenomenal paper which not only discusses the pitfalls of specific statistical algorithms, but also devises ways to reform them. 

Fan, Jianqing, et al. "Challenges of Big Data analysis." National Science Review vol 1: 293-314. (2014)

No comments:

Post a Comment