Multiple comparison problem
Contents |
The multiple comparison problem
The multiple comparison problem potentially arises whenever you would like to test multiple hypotheses simultaneously. If you don't correct for the number of comparisons, then the more hypotheses you test, the higher the probability of obtaining at least one false positive. Since imaging statistics often involve many thousands of simultaneous tests (typically one in each voxel in the brain), it's important to correct for the number of comparisons properly.
Just to be a little more concrete, consider a brain with 10,000 voxels. If we used a statistical test with a univariate false positive rate of 0.05, then we would expect to see 500 false positives in a functional brain map containing no genuine effect. Usually, that rate of spurious activation is unacceptable. Even worse, spatial smoothing can make it appear that the spurious activation forms coherent, highly plausible blobs. You can verify this easily with resting BOLD data.
The goal of correction, loosely speaking, is to meet community standards for statistical significance. At the moment, the prevailing standard is that reported results should come from a test with a family-wise error rate (FWER) of 0.05. The "family," in this case, is the entire set of tests in a single brain map, so this is sometimes called the map-wise false positive rate. Basically, this means that if it turns out that your manipulation has no effect whatsoever, anywhere in the brain (imagine the projector wasn't working), the probability of seeing so much as a single voxel above statistical threshold should be 0.05. There is obviously an arbitrary element to this, and it's open to debate how quickly research would progress if this standard were adjusted. At least one of the approaches noted below (false discovery rate control, or FDR) suggests an alternative standard that does not meet this criterion, but could be viewed as preferable.
In the remainder of this page, we discuss different approaches to dealing with multiple comparisons in functional brain imaging.
Bonferroni correction
Bonferroni correction is the simplest and most conservative approach to correction, often used in behavioral and other types of non-imaging research. To correct, you simply re-calculate your threshold to correspond to your desired map-wise alpha criterion divided by your total number of comparisons (in this case voxels). So if you have 10 voxels, you should only count as statistically significant results that are independently associated with a p value below 0.005. For a given voxel, your probability of a spurious result is tiny (and your threshold is often painfully high). However, you're guaranteed that the probability of seeing so much as a single supra-threshold voxel in the absence of an experimental effect will be at most 0.05. By the prevailing standards of the community, you can report all of the voxels surviving this threshold as statistically significant.
Bonferroni correction offers provably adequate control over the false positive rate (given the validity of the univariate statistical test). But it is overly conservative in general, more so when the number of independent observations is much smaller than the nominal number of observations. This is often the case in imaging, where spatial smoothness (intrinsic and extrinsic) usually means many fewer observations than voxels.
Random Field Theory
The application of random field theory (RFT) to brain imaging is an approach developed by Keith Worsley [link] and Karl Friston, and has been the dominant approach to correction. The approach relies on the insight, drawn from topology, that we can characterize the number of supra-threshold "blobs" in a functional map in terms of the "Euler Characteristic" (EC) of that map. Since there is an closed form expression for the expected EC of a Gaussian random field, we can use that expression to derive, within reasonable bounds, the expected number of blobs in our map, given no experimental effect and a given threshold. If we set the threshold to yield an expected EC of .05, then we would expect only 5% of our maps with no effect to have any supra-threshold blobs.
Random field theory is terrific, but it only offers advantages over Bonferroni correction when its assumptions are reasonably well satisfied and the map is sufficiently smooth. Generally, this means a round-ish region, a FWHM smoothness above a certain level, and a certain number of degrees of freedom. Outside of these bounds, RFT can sometimes be even more conservative than Bonferroni correction, and therefore shouldn't be used. For this reason, imaging software that implements RFT-based thresholding will calculate both thresholds and report the more liberal criterion.
Permutation testing with the minimum statistic
Permutation testing is a non-paramtric technique that offers asymptotically exact control over the false positive rate, that automatically accounts for spatial smoothness, that readily accomodates small volumes, that involves very few substantive assumptions, and that adapts easily to controlling the family-wise error rate. The main drawback to permutation testing is that it is computationally intensive, and even this disadvantage is becoming less and less meaningful as computers get more powerful. It is also difficult to use with single subject (autocorrelated) data in the general case.
Because permutation testing is increasingly a central component of VoxBo, and deserves to be much more widely used, we have devoted a separate page to it. We realize that if you haven't used permutation testing before, you might not feel comfortable diving right in. But we strongly encourage you to consider it, as it offers dramatic benefits and has few meaningful drawbacks. Permutation testing is not a new technique — it is actually an old technique that has only recently become computationally tractable for large problems.
For current purposes, it should be enough to say that permutation testing using the minimum statistic (the minimum statistical value across the entire brain, for each of some large number of random permutations) offers asymptotically exact control of the family-wise error rate, and should be the preferred method of FWER control in most cases.
False Discovery Rate (FDR)
False Discovery Rate (FDR) control is a relatively new arrival to fMRI (as of late 2001). Genovese, Lazar, and Nichols have advocated FDR as a new standard for reporting activation at the map level. The details of the technique are fairly simple, and well explained in their article. In a nutshell, FDR correction provides a different kind of assurance than the techniques that control FWER. Instead of guaranteeing a map-wise false positive rate of alpha, FDR guarantees the expected proportion of false positives among reported results. That is, if you set your threshold for an FDR criterion (aka q) of 0.05, and you see 100 voxels above your FDR threshold, then the expected number of false positives is 5.
As of this writing, it is unclear whether or not the imaging community (i.e., journal editors and reviewers) will embrace FDR as an acceptable method for deciding what's reportable and what's not. Nor is it clear what policy is in the best interest of science. At the moment, controlling FWER at 0.05 has historical weight behind it, and is certainly more conservative. But functional imaging is in a position to make its own decisions about standards, so it's possible FDR control will gain traction. Your best bet is to understand what assurances are provided by both methods and use each when you feel it's appropriate.
ROI Analysis
Comparisons that deal with ROIs (however they're defined) have a much simpler multiple comparison problem to deal with. If your hypothesis concerns activation across the entire caudate, then you only have a single observation per statistical map. Even though that observation is an average of many voxel values, they're all going into a single statistical test. If you have five ROIs, the correction is still much less severe than it would be for 40,000 voxels. Since there may be theoretical advantages as well to studies with more specific hypotheses, comparisons within ROIs are especially attractive when they fit with your research program.
