This is also related to something that's been called the decline effect.
And this is the subject of repopularized in her New Yorker article a few years ago,
which is a really wonderful article.
And the idea is that film effects that appeared very large in initial papers
subsequently decrease in effect size when they are resampling.
And the lead example here is an example of a drugs that seem to work initially and
they're test over the years, they get smaller and smaller.
The effect sizes get smaller and smaller.
And there many reasons going to happen, of course, a drugs including
changing standards for how diseases are diagnosed and drugs are applied.
But a big part of this decline effect is likely just simply regression to the mean.
So anytime you have a false positive finding that you happen to find published,
and then people start trying to replicate it.
If it's not a true finding, or if those findings have been picked out of a larger
family of experiments that happen to work the best, then when it's replicated
the effects will get smaller and smaller, and it will appear to decline.
So not so mysterious, but important.
And then we'll talk about circularity.
And this is a way of thinking about brain analyses.
Especially and the kinds of analysis that lead to bias effects and
how we avoid them.
So circularity is also called double-dipping, in which was popularized
by an article a few years ago by Nikolaus Kriegeskorte and colleagues.
And the idea of circularity is that you can select voxels to look at
based on one effect or test, and then test those voxels on something
that's not independent of that selection criterion.
So this is sort of pernicious, and so we have
to really think through our results and analysis and be careful to avoid it.
But here's an example, a work through example.
So what you see here is a panel of voxels and this is a study with four conditions,
A, B, C and D.
So now we've got the blue area here, which shows some true effects and
the true effect is A & B activate, C & D don't.
So now we're going to select data, there's the truth, or
we're going to select data based on this contrast A versus D.
So we selecting on A versus D and
we're picking up voxels, that show those A versus D effect.
Now what's going to happen is,
any voxels I test later are going to tend to show noise that favors A versus D.
So then, if I tested independent data, I would get about the right answer.
A and B activate, the other ones don't.
But, if I use that same data where the noise is favoring the A versus D
hypothesis, then I'm going to get a biased effect.
A is going to be greater than B, on average.
Just by chance.
So, how can I select on A versus D and get an A versus B effect?
Well I'm conditioning on noise values that tend to be high for
A, and that's not true for B, so I'm creating a bias.
And that's one of the, that's essentially the circularity problem in a nutshell, and
that's one of the big dangers in terms of selecting ROIs and then testing them.
We have to make sure that the data are independent.
And one popular way of selecting regions of interest is through contrasts that
are orthogonal to the tests of interest.
And so nominally you might that that avoids the circularity problem.
So I might select on a main effect of A B versus C D and then test on A minus B.
And that seems on the surface, pretty okay.
It's safer than truly non independent tests but
there can still be bias if you test on the same data.
Why?
Because the design matrix, the regressors for A,
B, C, and D can be correlated and that can produce effects, and
also the noise characteristics might be auto correlated so
the noise characteristics might also create a selection bias.
So we do have to be careful even when we're applying orthogonal contrasts.
It's better to test on independent data.
This basic circularity phenomenon is one of the other issues that was raised in
the voodoo correlations debate.
reported correlations in your brain and behavior from the literature.
And as you can see it ranges from about .2 to correlations close to 1,
which seem like fair large effect sizes.
And what the authors here did, is they broke it down into those in which they
thought that the test were independent of a voxel selection of criteria and
those that were not independent.
And so you see independent in green and non-independent in red.
And that's one estimate of what the inflation of the apparent
effect size due to the circularity or non-independent testing might be.
So here are some solutions.
One solution that were going to advocate
quite a bit later in the course is data splitting.
Hold out independent test data if you actually want to estimate effects sizes.
And we should want to estimate effects sizes.
So this means perhaps holding out a sub section of participants for
a later exact test of the findings that you report.
And also maybe holding out runs
if you're interested in making inferences within an individual person.
So, with single hypothesis test of the model.
If you develop a model or
a pattern across regions that you can integrate into a single test.
Then if you're just doing one test on new independent data
then you can get an unbiased estimate of how big that effect is.
We'll look at an example of this later.
And this principle goes beyond voxel selection to encompass
all kinds of model-building.
Whether your designing a fancy connectivity or dynamic causal model or
a predictive model, which includes multimodal data, or anything else really,
the same principles hold.
One really effective strategy, we'll talk about later is called cross-validation.
And what it is, it's an efficient data splitting strategy
in which model development is done on one set of data, training data.
And then testing is done on another subset of the data, systematically.
And we'll talk about that
when we talk about machine learning later in the course.
So let's talk a little bit more about selection bias in a more concrete way.
And let's look at how some forms of bias can combine with voxel
selection bias to multiplicitably increase false positives.
So for example, if I test two contrast maps and
I've gotten voxelselection bias in each contrast map,
I get twice the false positives and the corresponding increase in effect size.
If I do two experiments, I get twice the false positives.
This doesn't mean that we should correct for
multiple comparisons across every test that we've done.
But what it means is we need to be mindful of this when we interpret the effects.
So let's look at our four levels of bias, publication, experiment,
model selection, and voxels and tests, and how they play out in neuroimaging.
So this is a illustration of the file drawer problem.
And it's not from neuroimaging but I think it's quite illustrative.
And what we're looking at is studies of antidepressants that
have been submitted now to the British Medical Counsel.
And, across the five drugs, we're looking at
And the y axis shows the effects size of the antidepressant.
And so those are pretty high when we look at only the published studies.
But the nice thing about these drug studies is that there's a national
registry where they have to have all the data submitted.
So they can go back to the unpublished data as well and
look at all the studies that have been submitted to the registry.
And we see those in red.
And what you see here is across the drugs The effect sizes in
all of the studies are substantially lower than they are in the published studies.
And that's an example of the problem in action.
So here's the next section which will look at
flexibility in experiments and in the model.
And this flexibility in choosing which experiments and which models and
which outcomes that you want to look at after you've observed some effects.
And trying to optimize the chances of getting nice looking effects from
publication has been referred to as p-hacking in the literature.
And there are tests for p-hacking now, that people are interested in doing.
So this is one influential paper where they discuss this.
And I think this is the paper where they coin the term.
So Simmons Nelson and Simonsen.
And they point out that researchers have millions of decisions to make.
Whether to collect more data, which outliers to exclude,
which measures to analyze, which covariance to use.
And the newer imaging which types to pre-processing and modeling in correction.
And the idea of p-hacking is,
that you make the analysis decisions as the data are being analyzed.
And then you want to create findings in publish
that would create a bias just like voxel selection bias.
Some of the red flags for p-hacking are the use of median splits in the data,
high and low responders when it's actually a continuous distribution.
Why not use the continuous scores?
Unconventional analysis choices or internal inconsistent analysis choices.
So maybe the researchers looked through lots of different possibilities and
they just picked the thing that worked the best.
P-values close to the threshold of 0.05
is one of the red flags that they pointed out.
Not everybody's P-value can be just under 0.05.
That's a flag for some effect that really wasn't significant and
then you're trying to get it to be more significant.
And then finally, unusual numbers of subjects without explanation.
This is a view of the preprocessing pipeline.
And this is a paper by Josh Carp from a few years ago.
And what he did is he analysed the same data set many different ways.
In fact, he analysed them 34,000 different ways.
Just by picking different analysis steps and several strategies for
each analysis step.
So that he has got almost 7,000 unique analysis pipelines,
five multiple comparisons correction strategies, leading to 34,000 maps.
And so this is the mean activation across all of the analysis maps,
and also the range.
So the point is that there's a lot of variability
according to these analysis pipeline choices.
Some of them are better than others.
But the solution here for us is to really be principled and
consistent in your analysis choices.
It's not bad to have a good pipeline or
to change things about your pipeline to make it better.
But we should really make those choices in advance of looking at the results
as much as possible.
So here finally are some do's and don'ts in terms of
what we should do to avoid selection bias problems.
So don't is thoughtless analysis.
I don't want to give you the idea that you should just choose everything and
advance in one analysis and then be done and not look at your data.
It's really important to explore the data,
to examine the data, examine the assumptions.
The point is to get the right answer.
Not just to get the answer that we wanted in the first place.
And getting the right answer really requires looking at the data and
making some smart choices.
And sometimes we do have to change what we do based on what the data
actually looked like.
Don't do uncorrected exploratory analysis with strong conclusions.
So we can do those analyses but don't go to town and sell the story and
yourself on a finding that comes from an uncorrected exploratory analysis.
Do really work in advance to choose principled a prior hypothesis.
We talked about using net analysis to do that and that's in the next module.
And tried to conduct adequately powered studies which often requires a lot of
investment and resources.
There's a lot of various inferences that can come from that.
But do learn the techniques to make quantitative reverse inferences and
really use those to understand what your brain maps are telling you.
So that's the end of this module, thanks for listening.