0:14
This is the difference between covered and non covered people.
Easier to think of,or units when we look at this graph.
Let's say this blue square is our target population,
ideally you have the frame population be exactly the same size and
everything is fine, no coverage error.
But often the frame sits slightly off
at the covered population is really just a part of it.
What can this be?
Let's say we are interested in all people in the United States above
0:48
age 18 in households.
It used to be the case that we could fuse list of phone numbers at least for
a little while, when phone books were good and had complete coverage of
a target population, at least all households with a phone.
But increasingly that is a problem,
the telephone frame does not necessarily cover the entire population.
There are ineligible units.
Phone numbers that belong to different establishments or
are empty or they may even multiple cases.
And then there are many cases that are not covered
by that particular telephone frame.
And just think of cell phones.
They certainly would not be or often are not, part of listed telephone numbers.
But other settings would have similar issues here.
So the total survey population can be divided in those covered and
not covered by the frame.
Again if we look at this in equation notation,
we have now subscript C and U for covered and under covered.
Those are the two pieces and each of them is a proportion, so
the cover divided by the total end, that's the fraction of covered people,
the under covered divided by total, n is the fraction of under covered people.
And multiplying that by the average value,
that is of interest to us for the covered and the under covered respectively.
The sum of these two gives us the average value for Y, for all of our cases N.
You can rewrite that and you will see that there is a error coming here.
A difference between Y bar subscript c and
y bar subscript n and the area that you see here is the difference
between the covered and the uncovered, on that particular Y variable
2:43
multiplied by the faction of under covered over total N.
So the proportion of under covered people times the actual value.
So we can think of this as an undercoverage rate.
And the difference between the means for the covered and the uncovered cases.
3:02
So, how big that bias is depends on two elements,
or can be influenced by two elements, the rate and that difference.
If the difference is zero, then the rate doesn't matter.
If the difference is large, then even a small rate will cause a problem.
3:24
interesting to think of sampling bias was a sampling variance, they are two.
Sampling variance is just the pure variation from one
realization of a sample due to slightly different cases
being in each of the samples that are sampled from the population.
This is most commonly measured in statistics.
In surveys, confidence intervals and
standard errors give us a quantification for that source.
Sampling bias however, appears when I have a consistent failure
to estimate a proportion of the population.
Right?
So this would be a portion of the population, like military.
4:30
have in my survey and on average,
the mean of the means of each of these sampling distribution,
will give me the correct mean value that I have in the population.
So there's variation by age, and you've probably
seen this when you covered central limit theory in your Stats 101 course.
We have a little segment just showing animations for this particular
piece but If you already know all of that then of course, no need to look at that.
5:20
Sampling bias however, is people that are on the frame, but for
some reason have a zero probability to be selected.
A very important distinction here.
And that of course would create a sampling bias.
5:38
Non-response error is then the step between sampling and respondent.
The values of the statistic that can be computed out based on the respondent data,
and that can differ from the entire sample if we have missing data.
Missing data can come in two forms.
We can either miss entire units.
Non-respondents, so if you think of this
PC of this picture here being as our entire set of sampling cases and
in each row you have the values for each individual on a particular item.
There are some values on the frame data available for everybody, and
then we have interview data.
But those are only available for the respondents.
The non respondents are missing entire units are missing.
The interview data can have missing values as well.
Those we call item missing data.
And that can be a measurement problem.
So this arrow actually expands both of these graphs.
6:34
Just like we saw for coverage error, the non response error can be thought of
the total sample as being divided into respondents and non respondents.
And for each of them,
you have an average value of the variable that is of interest to you.
You can form a ratio of respondents over everybody.
And the non respondents, denoted here with m
6:56
in subscript s which is the sample that we're dealing with.
And so the non responsive rate together with the difference between the means
of the respondents and the means of the non respondents,
gives us a sense of non response error.
And finally there's adjustment error.
So although post stratification or
other forms of adjustment are supposed to correct any problem.
This correction can be done erroneously and therefore, create error itself.
7:28
Key notions from this segment that you should be learning, variable errors and
systematic errors.
Two concepts that should be clear in your mind and we hope that the quizzes that we
have, and other material in the readings, will help you fully understand that issue.
7:53
Mind you, there are no good or bad surveys.
There's only a good or
bad survey statistic, so the errors are a property of a statistic.
So for single variable or a model, a particular estimate,
a mean, a proportion, or a regression coefficient, for example.
That means that you can have, for the same survey,
some statistics with a large error, and others with a very small error.