The p-value is the observed significance level of a result. In the context of hypothesis testing it is the smallest level of significance at which the null hypothesis would be rejected for a given test procedure and data set. This allows for comparison with a chosen threshold of type one error, alpha, which is the probability of the rejecting the null hypothesis when it is actually true. Thus a p-value is the probability of obtaining a test statistic value at least as contradictory to the null hypothesis as the value observed if the null hypothesis is actually true. In terms of applicability, the p-value is used when determining if a test statistic value is statistically significant or not where the threshold of significance is defined by the minimum level of type 1 error one is willing to accept. In other words, the null hypothesis can be rejected (i.e. result of a test statistic is significant) if the probability of obtaining a test statistic value at least as contradictory to the null hypothesis as the observed test statistic value is less than the maximum probability one is willing to accept that the null hypothesis is actually true in this context of rejecting it.

Now the threshold for significance accepted by most in the scientific community is an alpha of 0.05. However, is that perhaps too large? If we are trying to determine if a particular treatment is successful at treating a disease is a 5% chance that the determination that this treatment is successful was a false positive acceptable? Now a lot of this is dependent upon the clinical context. If this were a case where a patient had little to no other alternatives to this treatment than it may be judged an appropriate threshold. Alternatively if there are treatments that are equally as effective a higher threshold for significance may be necessitated. Other factors aside from alternate treatment availability could be cost, risk of complications, and contraindications for future treatment options to name a few. It is trusted that clinical judgment is used when determining a threshold alpha, yet 5% seems to be the standard so much so that I suspect that it is used more out of habit rather than out of sound clinical judgment.

Additionally, it is important to make an assessment with regards to the goals of the study and whether a type 1 or type 2 (false negative) error would be worse. The significance of our statistical methods is focused on type 1 error, but if type 2 error is also or more critical that must be accounted for in the study design. For example, the power of a study needed, the compliment of type 2 error, should be determined a priori which would allow for an appropriate sample size to be determined to ensure the study is not underpowered, thus reducing the chance of type 2 error.

The p-value selected should also depend on the nature of the statistical analysis. For example, while for one test a p-value of 0.05 may seem reasonable if we begin to perform multiple tests and comparisons the actual error will increases over the collection of studies and it becomes more likely that among our results a false positive has occurred. Thus in the case of multiple comparisons it is essential to select a lower p-value to compensate.

As a final note, as had been alluded to throughout this blog post, the p-value, although the current foundation of statistical hypothesis testing, does not tell the whole picture (e.g. power and effect size) and is perhaps used too frequently, as a matter of habit. It should be expected that researchers determine what they actually want to show with their analysis and what the implications of their results will be and then choose the appropriate statistical test and parameters (alpha, beta, n ect.) accordingly.

The points about the often arbitrary nature of significance being set at 0.05 are well reasoned. I do wonder however if it is really very necessary to wonder about false negatives beyond power analysis as the cultural bias is so strongly geared in the opposite direction. Between p hacking and publication bias I think if anything we still need to primarily worry that we are not allowing for enough negative results.

ReplyDeleteI think your points are well made, particularly about the failure of P values to tell us much about power and effect size of the data.

ReplyDeleteI agree that when P values are used different thresholds are appropriate for different situations. Clinical trials are a circumstance where we should have very little tolerance for false positives, while something less directly relevant to human health can be a little (Just a little!) more lenient.

I think that whenever possible, authors should show all of their data points in figures. For very large studies this is impractical, but when your N is 15 (for example) there is really no excuse for not showing the spread of the data.

This comment has been removed by a blog administrator.

ReplyDelete