0:06

Welcome back to course six on combining and analyzing survey data.

In this module, basic estimation will continue with another example.

Using R just to assure you have define more complex sample

design than in the last video.

So what we're going to do is to use another dataset out of

the R PracTools package.

And this one's called nhis.large,

again it's from the US National Health Interview Survey here,

but it's a the full sample from that dataset.

I'm not treating this as a population.

It's got 21,588 persons.

It's got 75 strata, and 2 PSUs per stratum, so a total of 150 PSUs.

So you can see that the number of persons for

analysis is a lot bigger than the number of PSUs.

So there's a substantial amount of clustering here, and

as in the simpler example that we saw earlier.

We've got to define a design object for R, so it knows how to handle the data.

So the first thing I do is require the PracTools package, so I can get the data.

I require the survey package, so I can analyze it.

And then, I use the data statement to specify which dataset I'm going to use.

1:34

Now, this is a multi-stage survey, so I've got psus.

And the first-stage sampling unit is called PSU.

So I specify that here in the ids parameter.

In the strat parameter, I give it the name of the stratum which is stratum.

This is a field in the dataset.

And then the survey where it's called, svywt.

Now note that R expects these things to be a formula, so

you put it till the end of the frame.

2:04

You can use more complicated expressions to define these things sometimes,

but this is fairly easy with one field.

We tell that data is nhis.large.

And then another parameter is nest = TRUE, now what that means is,

the PSUs' are not numbered consecutively across the whole dataset,

they're renumbered within strat, one two, one two, one two, and so forth.

If you leave out the nest statement, survey will actually detect the fact that

they're not numbered consecutively and it will suggest you use the nest statement,

so you'll her about it if you don't put it in the initially.

Now, what kind of variance estimator are we going to get given

the amount of information we provided.

Surveys going to use the ultimate cluster variance estimator,

which assumes that PSUs are selected with replacement.

That's the default, if we were able to specify more detailed information,

then the survey package has available other variance estimators.

But this is kind of typical in a public use dataset, where the only choice you've

got is use the ultimate cluster with replacement variance estimator.

3:29

So we'll do a table of proportions, and the variable I'm going

to analyze is something called delayed medical care, because of cost.

What it is, is it's an indicator variable.

A person delayed getting medical treatment for something,

because it was too expensive in the prior year, and yes or no.

So we'll do a table on age.

So to do that, I use the svyby function.

4:03

I send the first parameter which is a formula, that's our analysis variable,

and to make sure it treats that as a factor, use yes, no variable.

I say factor here, and then delay.med in parentheses.

So factor is a kind of a function that's receiving delay.med.

And then the stub of the table is going to be age groups.

So there's a variable called age.grp in the file, I used that.

The function FUN here is survey mean, so I specify that.

There are other possibilities, survey total, for example.

You tell it the design object, which I just created.

And then, it's critical, but you include this na.rm=TRUE,

which means If the analysis variable or

the stub of the table has missing values, just take those out.

Otherwise, you're not going to get a table.

Now, the survey package does not tabulate those missing's out separately.

It might be nice if it did, but it doesn't ,you'd have to code them as

something other than na in order to get those to be tabulated.

So I save all that in an object called age.mns, mns for means.

And then it turns out the two columns out of this object that I want to look at for

the proportion and the standard error are the second and the fourth.

So I'm extracting those here.

And then just to make my table a little more readable, I specify rownames and

colnames for this age.mns object.

The second and fourth columns, which is what I extracted and

then I print those out here with the round function around four decimal places.

So you can see in the proportion column here, the proportions or

lower for young people under 18 years old, and older people,

65 or more, then they are for people in the working years.

And reason for that is that in the US all the young tend to have

medical insurance at a higher rate than the working age people.

So because they've got insurance, they tend not to delay treatment.

So here are the standard errors.

You can see they're a bit different.

And those are the width replacement standard errors.

6:44

Now, just for comparison, let's think about what would happen if

we just ignored the sample design; assumed I had a simple random

sample with no weights, where everybody's got a weight of one.

So I'll do that by hand essentially.

I'm going to save my output in a result, in an object called age.mns.srs.

And I'm just using the by function here.

So in this string here, I'm taking the absolute value of

nhis.large$delay.med and I've got a dollar

sign in there to separate the object name from the field within at the column.

I subtract two and take the absolute value I did that

because delayed.med is coded as one or two, for yes or no.

So if I subtract two and take the absolute value, I recode it to zero

one which is panel one deal with it.

Now, the other parameter, another parameter in the by function is indices.

That's just the stub of the table again.

So I say age.grp again.

The function I'm going to do is just the simple mean and

I also say take the missing values out.

And then, for the standard error, I compute that by hand,

the standard error proportion.

So this age.mnsB.

I haven't shown you the separate line of code, but it's this thing in the brace.

It's the record of delay.med-2, one variable.

So I'm taking the zero one variables times 1 minus that.

8:42

And I'm dividing by a table

of the counts in the stub of age.grp.

So I should say this age.mnsB is the proportion

who delayed medical care in a table, and

then around that and combine a couple of things.

So cbind means put two columns together.

9:14

So my first column is the age.mns.srs,

the simple random sample proportions divided by the complex

sample estimates which I saved in age.mns[, 1].

So that's just taking the ratio of the estimated proportion.

So I can see whether using weights made any difference there.

And I named that ratio p.hats.

And then I do the same thing for the standard errors.

So here's the standard error for the srs version which I just computed up here.

And here are the standard errors which I extracted

from the complex sample estimate object.

And I round those to two decimal places, so we don't have as much to look at.

So here's what I get.

The ratio of p.hats in this column are around 1, this's one's a bit off, 1.09.

But using weights doesn't make a tremendous difference in

the point estimate.

On the other hand, it makes a tremendous difference in the standard errors.

The first one, for example, the srs standard error

is 70% of the complex sample standard error and

the other three here are also this for

srs, almost the same for the fifth category.

Now, that doesn't mean you outta use the srs estimates because they're

more precise, what it means is you're getting a deceptively low

estimate of standard error by ignoring the clustering in this design.

So here you'd have,

if you were to put confidence intervals on these estimated proportions,

you'd have confidence intervals that were much too short.

They'd be like this, but they ought to be like that.

And you'd be fooling yourself that you're getting that much

precision from this complicated sample design.

11:30

Now, we can also do a test of independence here,

just to show you another analytic technique.

You might be interested in whether delayed medical care and

age are independent of each other.

Now, we saw in the table of proportions that they're pretty different for

the young and the old compared to the middle ages.

So you'd expect that this test of independence would

reject the hypothesis of independence.

Now, what happens here is we need to account for the complex sample designs.

So the function called svy chisq, c-h-i-s-q is the thing you use.

And then you specify a table this way.

Begin a formula with ~ delay.med + age.grp.

Here's my design object.

There are a couple of choices of statistics.

We're going to use one that's specified by F in quotes.

Which is called the Rao Scott adjusted Pearson's chi-square test statistics.

So this statistic ment to calculating Pearson's chi-square

which will be appropriate for simple random sampling but

then multiplying it by something to now, for the complex design.

So the function echoes back the way I called it and here's the output.

F is 48.295, numerator degrees of freedom

3.69, denominator is 276.89.

Now, notice that these are fractional which is okay.

You don't have to have integer degrees of freedom to deal with an F distribution.

So we refer that, to the F-table, and the software does that for us.

And the p-value's essentially zero.

So we reject this hypothesis of independence.

Pretty handily, in this case.