Monday, April 11, 2016

Know Your Data

Defining the type of data you will produce is a key step in designing an experiment. While categorizing data as discrete or continuous is one of the most straightforward concepts in statistics, things are not always as simple as they seem. Giving thought to typical data produced in my lab revealed that discrete data is not always discrete and continuous data is not always continuous. It is therefore important to understand your data in the form used for statistical testing and draw appropriate conclusions.

Discrete data may include whole numbers (5 pups in a litter) or categorical assignments (pregnant vs not pregnant) while continuous data is described as being scalar and able to take on a range of values when measured (e.g. height, weight, , time etc.). Data may start out as discrete or continuous, but categorization or manipulation of the data can quickly change its properties. For instance, binning of continuous data, such as weight of mice into underweight, normal, and overweight produces discrete data from the same measures. A major focus of my lab is the study of conditions and factors which influence gene reassortment of influenza viruses. To measure reassortment levels, we employ a two-virus system in which the original virus, designated “wild-type” (Wt), and a “variant” (Var) virus differ only by engineered silent mutations. These markers allow differentiation of segments by high resolution melt analysis, with mutations conferring different melt temperatures for segments of the Wt or Var origin. Genotypes which contain all segments from either the Wt or Var virus are categorized as “parental” while a virus isolate containing any combination of Wt and Var gene segments are “reassortant.” 

The reassortment level arising from a coinfection is determined by calculating percent reassortment as follows:

Much like the transformation of continuous data by binning into groups, discrete genotypes are transformed into a value for % reassortment and then treated as continuous data. Despite not being true continuous data discussed here, the percent value is routinely treated as such for meaningful statistical comparisons between groups.Though the underlying data is discrete, it intuitively makes sense to treat the values this way as we are describing the features of a population instead of individual data. For instance, calculating the percent of your mice which are pregnant, say 75% does not mean that mice are 75% pregnant, it means that 75% of your sample population is pregnant. Being able to accurately define the type of data you are using for analysis enables one to choose statistical tests and appropriately describe results.

1 comment:

  1. However, while saying 75% of 4 mice are pregnant holds meaning, 75% of 2 mice does not. Really cool research to read about, I'm wondering, though, are there any other similar restrictions/constraints on your 'continuous' value of % reassortment? And how does that affect the statistical analyses you perform?