0:15

So the package I'm going to use is the R survey package which is got a lot of

capabilities.

This is the one that's written by Thomas Lumley at the University of Auckland

in New Zealand.

And there's a data set in there called academic performance index,

api, which I'll use.

So I require the survey package and

then I ordered R that this is the data I want, by saying data api.

0:45

And then, you define a design object.

So, with any of the software where going to handle survey data,

you have to inform what the design features are.

I'll talk more about this in course six,

the meaning of the survey design function and

how you do it in other packages, but I'll sketch it here.

The first thing you need to tell R is what the first stage units are.

So the parameter in R is id.

In this case, I'm saying that dnum

is the psu or cluster definition.

And in this case, dnum is short for district number.

2:32

And notice, that the first stage unit is treated as a formula.

The weights are also, the fpc is also,

the data set itself is not used to specify that without a tilde.

Now, the next thing you have to do is specify the totals for

the population auxiliary variables that I'm going to use.

So what I've done here is I created data frame using this data dot frame statement.

The first column is going to be school type.

So the labels for that are E, H, and M which stands for

elementary, high school, and middle.

And these are just different grade ranges that are used in the US.

And then, I get the count of schools, 4421, 755,

1018 in those three school types.

Now, how do I post stratify?

I just invoke this command postStratify.

3:38

Notice here, that that's a capital S.

R's case sensitive.

So if you use the lower case s there,

it would bark at you until you couldn't find function.

So you've got to be careful to see exactly how your function name's spelled.

So I'm operating on this design object, dclus1,

I am poststratifying by these indexes

stype, and notice that's a formula again tilde in the front.

And then, I give it the control totals here, pop.types and

I define those back in the previous line.

And that's all there is to it.

It goes through, it creates these post stratified weights

that saves that information and it's saved into this new

object called dclus1p, p for postratification.

So just take a look at the weights that came out of this.

What I've done is I've rbind a summary of the weights for

the non post stratified design object dclus1.

And then, the weights for the dclus1 object.

So, this thing this weights function right here is an extractor kind of function.

It'll pull away to out of that design object and show them to you.

5:11

So what would have I got if I look at the first row here,

that's the non post stratified object.

You see all the weights are the same.

So I've got an equal probability sample.

In the second row, those are the post stratified weight and

you can see those are spread out from 30.7 up to 53.93.

And why is that, it's because the sample itself is not

proportionally allocated among these school types.

So post stratifying, in a sense, corrects that and

we hope that, that will reduce variances.

If I had coverage errors it, we also hope that it will reduce those.

5:57

So let's look at a couple results just to see how the point of estimates can change.

The first thing that I called for

here is the mean for variable called enrollment so

sv one mean is the function that I use to do that.

Enroll is the variable and it has a tilde there again.

It has to be specified as a formula, and here's the design object, dclus1.

So this is the before post straficiation version of this.

I get a point estimate of 549.72 students

per school enrolled, standard error 45.19.

If I do the same thing on the post stratified object,

you can see my point estimate mean change some.

I'm up to 594.27 and the standard error got bigger too.

7:00

Now, let's take a look at total.

So note this, things can go in any direction but

if I compare the standard error before postsratification and

after actually made things worse in terms of the standard error.

The mean got bigger, the coefficient of variation could be actually smaller but

in this case it's not.

But there's no guarantee that you're going to improve estimates

of the mean with post stratification or of the total, although you may.

So let's look at the total, and

we can do that with the svytotal or enroll again, same variable.

So here, the answers there and

I've got a total of about 3.4 million,

standard error of 932 thousand and some.

And if I use the post stratified version dclus1p,

then what I get is this line right here.

And so you see, the total changed a bit, not tremendously,

but standard error did change quite a lot.

If I compare these two values.

Before post stratification, I went from 932,000,

after post stratification, I go to 406,000.

So I cut standard error by over 50% despite post stratifying and

that's on the estimated total.

Now, I can look at cvs and the function that will

do that is called little cv in the survey package.

So here, I just collect together the coefficient variation for

the mean of enrollment from the dclus1 post stratified object,

and then from the post stratified object.

So you see right here, I go from a cv of 0.82 to 0.00110,

so either in terms of standard error or

cv I made things worst by post stratifying here.

If I look at totals here, on the other hand, and I do or

compare the post stratify and the non poststratified object.

I go from .2737 or so to .1103.

In other words, I gained quite a lot

in terms of cv and standard error by post stratifying.

So notice also,

that these two are the same.

The cv on the mean and the total.

Now, why is that?

10:06

It's because, when I divide by the sum of the weights to get mean,

I force the sum of the weights in the post stratify

variable to equal the population count.

Now, that's going to be true in every sample, so there's no extra variation.

That's like a constant.

After I poststratify that estimated pop count.

So what that leads to is the standard error relative to what you're estimating.

Is exactly the same in the mean and the total.

10:47

Now, have you decided whether this post stratifying is good idea or

not, there's different ways of doing it.

But one way to think about this is every estimator has an implied model behind it.

And by model, I mean a structural model that relates why to whatever

covariants you're using in your estimator.

So in the post stratification case, it's really simple.

Common mean in every poststratum, call it beta sub gamma, common variance for

every element in a given post stratum, call it sigma squared gamma.

11:35

Think of this as the way I would predict an additional

school within post rate of gamma, I take the mean of what I saw and

predict the next value would be equal to that mean,

if the common mean is a good predictor then you get variance reduction.

If it's not, you don't and that's what this bullet says.

The one thing about the post stratified estimator is it'll be approximately

design-unbiased, meaning in repeated sampling, if you do it over and

over again, you'll average out to the right thing even if the model is wrong.

13:07

So I cross age by gender and that creates a number

of age groups times two or three post strata.

But suppose that you had two other variables that you should've considered.

Race-ethnicity and income level.

Because those are good predictors of the y's that you're analyzing from your data.

Then, you'll have the wrong model,

post stratification will be less efficient than it could be.

How do you take up the slack for that, or improve your estimator?

You could think about using raking where you included race-ethnicity and

income level as margins to write to.

Or you could use GREG which will accommodate if you had quantitative

income value, you see it would've accommodated both those in these

categorical age, gender and race ethnicity variables.

So, you can make post stratification fairly flexible but

if you do want to include both qualitative and quantitative,

then your best choice maybe this thing GREG.