Multiple comparisons

From Wikipedia, the free encyclopedia

Jump to: navigation, search

In statistics, the multiple comparisons (or "multiple testing") problem occurs when one considers a set, or family, of statistical inferences simultaneously. Errors in inference, including confidence intervals that fail to include their corresponding population parameters, or hypothesis tests that incorrectly reject the null hypothesis, are more likely to occur when one considers the family as a whole. Several statistical techniques have been developed to prevent this from happening, allowing significance levels for single and multiple comparisons to be directly compared. These techniques generally require a stronger level of evidence to be observed in order for an individual comparison to be deemed "significant", so as to compensate for the number of inferences being made.

Contents

[edit] Practical examples of multiple comparisons

The term "comparisons" in multiple comparisons typically refers to comparisons of two groups, such as a treatment group and a control group. "Multiple comparisons" arise when a statistical analysis encompases a number of formal comparisons, with the presumption that attention will focus on the strongest differences among all comparisons that are made. Failure to compensate for multiple comparisons can have important real-world consequences, as illustrated by the following examples.

  • Suppose the treatment is a new way of teaching writing to students, and the control is the standard way of teaching writing. Students in the two groups can be compared in terms of grammar, spelling, organization, content, and so on. As more attributes are compared, it becomes more likely that the treatment and control groups will appear to differ on at least one attribute.
  • Suppose we consider the efficacy of a drug in terms of the the reduction of any one of a number of disease symptoms. As more symptoms are considered, it becomes more likely that the drug will appear to be an improvement over existing drugs in terms of at least one symptom.
  • Suppose we consider the safety of a drug in terms of the occurrences of different types of side effects. As more types of side effects are considered, it becomes more likely that the new drug will appear to be less safe than existing drugs in terms of at least one side effect.

In all three examples, as the number of comparisons increases, it becomes more likely that the groups being compared will appear to differ in terms of at least one attribute. However a difference between the groups is only meaningful if it generalizes to an independent sample of data (e.g. to an independent set of people treated with the same drug). Our confidence that a result will generalize to independent data should generally be weaker if it is observed as part of an analysis that involves multiple comparisons, rather than an analysis that involves only a single comparison.

[edit] Multiple comparisons for confidence intervals and hypothesis tests

The family of statistical inferences that occur in a multiple comparisons analysis can comprise confidence intervals, hypothesis tests, or both in combination.

To illustrate the issue in terms of confidence intervals, note that a single confidence interval at the 95% level will likely contain the population parameter it is meant to contain. However, if one considers 100 confidence intervals, simultaneously, with coverage probability 95% each, it is highly likely that at least one will not contain its population parameter. The expected number of such intervals is 5, and if the intervals are independent, the probability that at least one interval does not contain the population parameter is 99.4%.

If the inferences are hypothesis tests rather than confidence intervals, the same issue arises. With just one test, performed at the 5% level, there is only a 5% chance of incorrectly rejecting the null hypothesis. However, with 100 tests where all null hypotheses are true, the expected number of incorrect rejections is 5. If the tests are independent, the probability of at least one incorrect rejection is 99.4%. The situation is analogous to the confidence interval case. These errors are called false positives. Many techniques have been developed to control the false positive error rate associated with making multiple statistical comparisons.

[edit] Example -- Flipping coins

For example, one might declare that a coin was biased if in 10 flips it landed heads at least 9 times. Indeed, if one assumes as a null hypothesis that the coin is fair, then the probability that a fair coin would come up heads at least 9 out of 10 times is (10 + 1) × (1/2)10 = 0.0107. This is relatively unlikely, and under statistical criteria such as p-value < 0.05, one would declare that the null hypothesis should be rejected — i.e. the coin is unfair.

A multiple comparisons problem arises if one wanted to use this test (which is appropriate for testing the fairness of a single coin), to test the fairness of many coins. Imagine if one was to test 100 fair coins by this method. Given that the probability of a fair coin coming up 9 or 10 heads in 10 flips is 0.0107, one would expect that in flipping 100 fair coins ten times each, to see a particular (i.e. pre-selected) coin come up heads 9 or 10 times would still be very unlikely, but seeing some coin, it doesn't matter which one, behave that way would be more likely than not. Precisely, the likelihood that all 100 fair coins are identified as fair by this criterion is (1 − 0.0107)100 ≈ 0.34. Therefore the application of our single-test coin-fairness criterion to multiple comparisons would more likely than not falsely identify at least one fair coin as unfair.

[edit] Formalism

For hypothesis testing, the problem of multiple comparisons (also known as the multiple testing problem) results from the increase in type I error that occurs when statistical tests are used repeatedly. If n independent comparisons are performed, the experiment-wide significance level α, also termed FWER for familywise error rate, is given by

 \alpha = 1-\left( 1-\alpha_\mathrm{per\ comparison} \right)^\mbox{number of comparisons}.

Unless the tests are perfectly dependent, α increases as the number of comparisons increases. If we do not assume that the comparisons are independent, then we can still say:

 \alpha \le \alpha_\mathrm{per\ comparison} \times \mbox{number of comparisons},

which follows from Boole's inequality.

[edit] Methods

In order to retain a prescribed familywise error rate α in an analysis involving more than one comparison, the error rate for each comparison must be more stringent than α. Boole's inequality implies that if each test is performed to have type I error rate α/n, the total error rate will not exceed α. This is called the Bonferroni correction, and is one of the most commonly used approaches for multiple comparisons.

In some situations, the Bonferroni correction is substantially conservative, i.e., the actual familywise error rate is much less than the prescribed level α. This occurs when the test statistics are highly dependent (in the extreme case where the tests are perfectly dependent, the familywise error rate with no multiple comparisons adjustment and the per-test error rates are identical). For example, in fMRI analysis, tests are done on over 100000 voxels in the brain. The Bonferroni method would require p-values to be smaller than .05/100000 to declare significance. Since adjacent voxels tend to be highly correlated, this threshold is generally too stringent.

Because simple techniques such as the Bonferroni method can be too conservative, there has been a great deal of attention paid to developing better techniques, such that the overall rate of false positives can be maintained without inflating the rate of false negatives unnecessarily. Such methods can be divided into general categories:

  • Methods where total alpha can be proved to never exceed 0.05 (or some other chosen value) under any conditions. These methods provide "strong" control against Type I error, in all conditions including a partially correct null hypothesis.
  • Methods where total alpha can be proved not to exceed 0.05 except under certain defined conditions.
  • Methods which rely on an omnibus test before proceeding to multiple comparisons. Typically these methods require a significant ANOVA/Tukey range test before proceeding to multiple comparisons. These methods have "weak" control of Type I error.
  • Empirical methods, which control the proportion of Type I errors adaptively, utilizing correlation and distribution characteristics of the observed data.

The advent of computerized resampling methods, such as bootstrapping and Monte Carlo simulations, has given rise to many techniques in the latter category. In some cases where exhaustive permutation resampling is performed, these tests provide exact, strong control of Type I error rates; in other cases, such as bootstrap sampling, they provide only approximate control.

[edit] Post-hoc testing of ANOVAs

Multiple comparison procedures are commonly used after obtaining a significant omnibus test, like the ANOVA F-test. The significant ANOVA result suggests rejecting the global null hypothesis H0 = "means are the same". Multiple comparison procedures are then used to determine which means are different from which.

Comparing K means involves K(K − 1)/2 pairwise comparisons.

The Kruskal–Wallis test is the non-parametric alternative to ANOVA. Multiple comparisons can be done using pairwise comparisons (for example using Wilcoxon rank sum tests) and using a correction to determine if the post-hoc tests are significiant (for example a Bonferroni correction).

[edit] Large-scale multiple testing

Traditionally, work on multiple comparisons focused on correcting for modest numbers of comparisons, often in an analysis of variance. More recently, focus has shifted to "large-scale multiple testing" in which thousands or even greater numbers of tests are performed. For example, in genomics, when using technologies such as microarrays, expression levels of tens of thousands of genes can be measured, and genotypes for millions of genes can be measured. While methods to control the familywise error rate are used in these problems, one can alternatively control the false discovery rate (FDR), defined to be the expected proportion of false positives among all significant tests. One simple meta-test is to use a Poisson distribution whose mean is the expected number of significant tests, equal to α times the number of comparisons, to estimate the likelihood of finding any given number of significant tests.

[edit] See also

Key concepts
General methods of alpha adjustment for multiple comparisons
Single-step procedures
Two-step procedures
Multi-step procedures based on Studentized range statistic
Bayesian methods

[edit] Bibliography

  • Miller, R G (1966) Simultaneous Statistical Inference (New York: McGraw-Hill). ISBN 0-387-90548-0
  • Miller, R G (1981) "Simultaneous Statistical Inference 2nd Ed" (Springer Verlag New York) ISBN 0-387-90548-0
  • Benjamini, Y, and Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B (Methodological) 57:125–133.
  • Storey JD and Tibshirani (2003) "Statistical significance for genome-wide studies" PNAS 100, 9440–9445. [1]
Personal tools
Languages