Tuesday, March 15, 2016

Regression and Correlation in R

As we saw today in class, there are many variables that can appear to be highly correlated, though implicating causation due to simple correlation is highly misleading. Here, I have made up some data relating the Time I wake up to the amount of food I ate last night, and it appears to be highly correlated. Does this imply causation?


Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -37.84     123.63  -0.306     0.76    
time          129.07      12.10  10.668   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 141.9 on 98 degrees of freedom
Multiple R-squared:  0.5373, Adjusted R-squared:  0.5326 
F-statistic: 113.8 on 1 and 98 DF,  p-value: < 2.2e-16

I also have plotted the Time I wake up by the contamination in my LPS. As you can see from the plot below, there is no correlation between these two variables.

Residuals:
    Min      1Q  Median      3Q     Max 
-50.503 -27.703   3.398  23.272  47.583 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  52.9787    26.0499   2.034   0.0447 *
time         -0.1027     2.5492  -0.040   0.9680  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 29.89 on 98 degrees of freedom
Multiple R-squared:  1.655e-05, Adjusted R-squared:  -0.01019 
F-statistic: 0.001622 on 1 and 98 DF,  p-value: 0.968

Clearly this plot shows that the amount of contamination I have does not determine the time that I wake up.

As a result, I have learned how to plot data and find the linear regression of that data using R!


1 comment:

  1. Honestly speaking, statistical methods show only the association (or correlation). Causality, which is one kind of association, can be inferred when the study is designed for causal inference. One necessary element of causality is the temporal relationship -- the cause MUST happen before the result. Therefore, in observational study, cohort is more convinced than case-control study when the researchers attempt to establish the causality between a risk factors and a disease.

    For the first plot, apparently 'the amount of food I ate last night' (food) happened before 'the time I wake up' (time), so it would be better to plot time vs. food (i.e., put time at Y-axis and food at X-axis). However, this change does not affect the significance, only the estimate of slope becomes the reciprocal of the original slop (b' = 1/b), and the intercept becomes -b/a, where a and b are the original intercept and slop, respectively.

    Just some R code: Use out <- lm(y ~ x) to do the simple regression analysis, use plot(x, y) to generate a scatter plot, and use abline(out) to add the regression line on the plot. Here y is the time that I wake up and x is the amount of food I ate last night or percent contamination of LPS.

    ReplyDelete