Hi, welcome back. In this module, we're going to talk more about multiple comparisons, including some new procedures and some pitfalls. So, for this, we'll use an example, simulation framework. In which, we have some signal, which is on the top row, and you see signal in white, and then the area around it with no signal in black. That signal is mixed with noise in these simulations, so the noise here you can just see, some random noise. And then what we observe, as always with brain maps, is signal plus noise together. So, what we should see is true, positive, we should see positive findings inside the square and negative findings outside the square. It should be white inside, black outside in a perfect test. So, let's look at three different kinds of multiple comparisons correction. First of all, let's look at no correction with an alpha level acceptable false positive rate of .10. And what I see here is we get most of the signal inside the square, so that's good. But we have a lot of stuff happening outside the square. Those are false positives. So, those are bad. Let's look at what happens when we control the familywise error rate at 10%. So that would be P is less than 0.10 corrected from multiple comparisons across the image. When we do that, what we should see is one in ten maps should show any false positive anywhere. And indeed, that's what we see. Only one in ten maps, there it's in map number seven, shows some significant activation outside of the activated area. So, that's circled in red. But what's happened is we've missed much of the true activation, so we have lots of misses or false negatives, lots of black inside the square. Now, finally, we'll look at the bottom panel. This is an example of false discovery rate control, which is another technique. And here, what we expect is, of the things that we're reporting, only about 10% of those should be false positive findings. So, indeed, that's about what we see. And you can see that we've now captured most of the true positives inside the squares where sensitivity is high and we have some false positives but not so many that we can't deal with that. So, it's a balance, and what the numbers below those boxes represent is the actual false discovery rate per image. So, we're controlling the expected false discovery rate, but the actual number of false discoveries varies from image to image. It's up to each of us to decide how we're going to correct for multiple comparisons and what standards we should adhere to. But for imaging, false discovery rate has emerged as a very popular alternative, because it's more sensitive than familywise error rate control, which is quite important for limited sample size, but it still provides reasonable control at the false positive rate. Another development is what's called cluster-level inference. This is available in a number of software packages now, and this is the two-step process. So first, we have to set an arbitrary threshold, which is called the cluster-defining threshold. So, we'll call that U sub clus. And then, we're going to keep all the clusters that are larger in terms of contiguous voxels, then threshold K of alpha, and K now is a number of clusters that we need. So, that's a cluster size perimeter. And we can illustrate this in this little one-dimensional diagram, which is a nice, simple illustration from Tom Nichols. Thanks Tom. And here, what we see, is a one-dimensional brand that's stretched out. The pink line shows the voxel statistics, they could be t-statistics or anything that you like across the voxels. And what we see then is that there's two areas of interest here. One area, on the left, is this giant peak where the t-values are very high. So, there's a lot of evidence that there's real stuff happening there, but the area's relatively small. Next to that, is an area where the statistics are not as strong, but there's a lot of contiguous voxels that have some elevated values. So, we set the cluster-defining threshold, and that's that black line. And then we look at whatever sticks up above that threshold. Well on the left, we see there's only a few voxels that are above the threshold, and that may not meet the criteria. So, that's not going to be a significant cluster. On the right, even though the statistic values are less strong individually, there's a very large area and that turns out to be larger than our threshold K's of alpha. So, we're going to say, that's a significant cluster. So, cluster-level inference has some advantages. There's typically better sensitivity, especially to weak distributed signals. So, it's easier to pick up on activation. But there's worse spatial specificity. A really important idea here is that if I say this cluster is bigger than I expect by chance, what's the null hypothesis? Well, the null hypothesis is that there's some signal somewhere in the cluster. All it can say is it's bigger than a chance, that means that at least one or more voxels has true signal in it. But, you don't know where those activated voxel or voxels are. So, all you can do is say, this is a big blob, you can't say where the activity is. Another new development is what's called threshold-free cluster enhancement. This is one of several techniques for combining information about the intensity, or how big the T statistics, if you will, are and how many voxels there are, the spatial extent, into one integrated test. This is a very sensible thing. So, the TFCE algorithm takes the integral of the magnitude of the statistic values times their area above some threshold, and this is evaluated at multiple thresholds. So, this is implemented now in FSL's randomized package, and it's becoming increasingly widely used. So, let's look at what people are actually doing. And this is a survey done by [INAUDIBLE] Woo of over 800 papers published in top journals. And what most people are doing, 75% are using cluster-extent based correction. It's more sensitive. A few people report using uncorrected thresholds, although in practice probably many more people have uncorrected or improperly corrected thresholds that actually appears here, and 19% report using voxel-based, a familywise error rate or FDR correction. So, let's talk about some of the pitfalls. We'll talk about pitfalls of using uncorrected thresholds and some important pitfalls of cluster-extent base thresholding. So, uncorrected thresholds are very appealing because they provide greater sensitivity. So, many studies use them. P is less than .001 is a very common uncorrected threshold to use. And a likely reason is that the sample sizes are usually small enough that if you try to correct for familywise error rate control wIth voxel level, you get nothing. It's very, very difficult to find any significant signal. Power is extremely low with that kind of connection, but this is problematic when we're interpreting conclusions from these studies, as there is a high false positive rate. Also, another issue is that null findings are hard to disseminate, and it's difficult to refute false positives. So, once somebody says this area is activated at 0.001, others may come along and not find that, but nobody notices or pays attention to that. So, one solution that people often use is to use an arbitrary extent threshold. So, for example, I might threshold a map at 0.001 uncorrected, and then I might say, retain only clusters that have ten or more contiguous voxels, and say those are significant. And, at first, that seems very sensible. You think, okay, well, what are the chances of ten voxels together happening? Unfortunately, the chances of ten voxels together by chance is pretty good, why is that? I'll show you why right now. So, here is a simulated activation map with spatially correlated noise. And this is very consistent with what we find in the actual brain, noise is spatially correlated. And what you see here is that alpha is 0.1. I see lots of false positives at alpha is 0.01. Fewer false positives in that 0.001. I still see some false positives outside that square. But the thing to note here is that, that false positive blob, or all the false positive blobs, they're not just one voxel. They're whole areas of contiguous voxels. Why? If one thing sticks up above by chance, it's neighbors are also going to stick up above the threshold by chance. So, the voxels are not independent. And that's why it's really not appropriate to impose arbitrary height and extent thresholds. It doesn't actually provide correction for multiple comparisons at the level that we'd like. There might be other reasons for doing it, to balance false positives and false negatives, but it doesn't provide familywise error rate control. Let's talk now about pitfalls of cluster-extent based correction. And this also applies to a threshold-free clustering enhancement. So, let's look at what people are doing. Of the people who are doing cluster-extent based thresholding, that's 75% of the studies, let's look at which cluster defining thresholds that people are using. Remember, this is an arbitrary choice. I might threshold first at p is less than 0.01, or p is less that 0.001, and then, look at how big the clusters are above that threshold. And what you see here is among the major software packages. The threshold that people use is defined by the default in this software package. So, the big yellow bar you see here is if people use FSL, they usually set up primary threshold that p is less than 0.01, and that's the default in that package. And if are people are using SPM, the other most widely used package, then the most common threshold is p is less than 0.001. And that's because that's the default in that package. Okay, so what people are doing is fairly arbitrary. Now, let's look at the impact of what happens when you set these primary thresholds in different locations. Well, if I threshold, this is a pain map that we've generated from our lab. If I threshold at p is less than 0.01, and I look at the size of the clusters sticking up, I get this map. And this map looks pretty good. It's got all these pain areas, lots of activation everywhere. But it turns out that there are only two contiguous blobs. We color code those blobs here, blue and orange. So, what can I conclude from this? Well, it's very tempting to look at that map and say, all these areas are activated. But that's not the right inference. That's wrong. Really what I can say is, is that in each blob, there is at least one voxel that is significant. So, I have to say, well, I've got activity in the blue blob here in the thalamus, or the insular or the prefrontal cortex, or the sensory motor cortex, but I don't know where. So, that is not as appealing of inference, is it? So, that creates a big problem because that obviates many of the reasons that we do functional imaging in the first place because we're on localized activity. Now, these simulations, I'll just give you a few take homes, they illustrate some of the issues with cluster-extent based thresholding. On the left, we're looking at the size of significant clusters, and the idea is if we threshold at p is less than 0.01, then we often get non-specificity. The blobs that we get are much bigger than useful anatomical areas. So, the problem I illustrated in the previous slide, is a problem that occurs all the time. And it creates appealing looking maps and we can't make the imprints from them that we'd like to make. So, that's one issue. Secondly, if I do try to interpret each voxel on that map, if I say, look at all these areas are really probably active, then I will be wrong much of the time. So, in similar simulations, in simulations that are similar to the case that I showed you and a range of cases, 45 to 70% of the activated voxels in the map are actually not truly active. So, there's a very high false discovery rate. And a third and final problem I'll highlight here is the problem of false discoveries. For technical reasons, the familywise error rate is not properly controlled when you set very low primary thresholds. And that's due to the mechanics of [INAUDIBLE] field theory and how it works. It turns out you're not actually getting, if you ask for p is less than 0.05, familywise error rate corrected, you're not actually getting 0.05 corrected. And what you can see here is that when the primary threshold is at 0.01, so relatively liberal, then the false positive rate is high. At 0.001 in these simulations at least, or below, it's properly controlled. In that level, 0.01 to 0.001 will depend on the characteristics of your data. So, it's not a hard and fast rule. But what it does mean is that thresholding at at least p is less than 0.001 is a good practice. Additionally, it makes sense to set a threshold, so you don't get giant blobs that span multiple anatomical regions because those are not interpretable. So, let's wrap up. We've talked about several multiple comparison methods. We talked about using uncorrected methods. We talked about familywise error rate, false discovery rate, and cluster extent-based methods. And we talked about several important pitfalls with using uncorrected thresholds and with using cluster-extent based thresholds. That's the end of this module. At this point, we've covered many aspects of experimental design, physics, acquisition, and analysis. So, you should have a good working model for how to do an fMRI experiment and interpret the results. >> Alright. >> [APPLAUSE]