0:02

We'll finish module one on Basic Estimation with an example of estimating quantiles in R.

Now, the thing that happens with quantiles is you need

special methods in order to get precision estimates;

the formulas that we've seen earlier don't work because

a quantile estimate is a nonlinear kind of statistic,

which means that you've got to use a different approach.

So, the standard thing is compute a confidence interval on a quantile first,

and then you back into a standard error,

if you really want that.

So the way it's done is you calculate the length of the confidence interval,

and then you pretend like it's a symmetric interval based on a t-distribution.

So the length of a t-interval will be

twice the multiplier from t-distribution times the standard error,

because you're adding and subtracting for a symmetric interval.

And then, you know that length based on

the calculation that one of these methods is gonna give you for a quantile,

and then you solve for the standard error.

So that's how it goes.

So for example, if one minus alpha is 0.95,

there's a huge number of degrees of freedom,

so the t is the same as normal,

then the standard error is gonna be estimated as the length of the confidence interval

divided by two times 1.96.

Now, there are a couple of options for getting confidence intervals.

These are named after the people who invented them -- Woodruff and Francisco-Fuller --

and both those are available in the R survey package.

So let's take a look at an example.

The first thing I'm gonna do

is require the PracTools package because I want to use a dataset in there.

Then I require sampling because I'm going to select a sample from a particular dataset.

And the one that I'll use here is called the Survey of Mental Health Organizations and,

it's in PracTools, it's called smho.N874.

874 is just the number of hospitals in that dataset.

So I'm gonna select a pps sample just so I'll

have a sample with different weights for the units.

First, I set the measure of size equal to the number of inpatient beds in each hospital.

Now, some hospitals have zero inpatient beds because they don't treat inpatients.

So to take care of that,

I'm re-coding all the sizes that are less than or equal to 5 to be equal to 5.

That way, I get all cases have nonzero measures the size.

I compute inclusion probabilities pk with

this function from the sampling package called inclusionprobabilities.

You know, it's a long name but it's clear what it means.

So I assume that the measure of size is size which I calculated up here,

and I'm going to pick a sample of 100.

So, the inclusionprobabilities is

nice because it figures out if you've got any units that are so big,

they'll be selected with certainty.

So when I summarize pk, I get a small minimum,

0.007, all the way up to one, which are units selected with certainty.

Now, you might not want to design a sample with the weights vary that much,

it might not be that efficient.

But this is just an example we're doing here.

Now, how do I select a pps sample?

First, I used the set.seed statement with a constant number in here.

That just starts the random number generator at

a particular place so I can repeat this if I wanted to.

And then I use the UPsystematic(pk) function to draw a sample.

So, in whatever order the smho population is,

this will take a random start and then it'll skip

systematically down the interval or the list of hospitals and select 100.

This is after extracting those that are selected with certainty.

So I save that, the result of that,

in a vector called sam which is a string of zeros

and ones for whether you're in the sample or not, 874 long.

I use the getdata function to retrieve the data,

and then I'm gonna bind on the survey weights.

So I use this cbind function with the data I just extracted,

samdat, and then I append a new column that I'm gonna call svywt.

And that's just one over the selection probability.

And I'm extracting, as you see here,

I only extract the pieces of pk where sam is equal to one.

In other words, the sample cases.

And then I'm going to do a calculation using expenditure totals.

So what I do is,

these are very large numbers in the data file,

so I'm converting them to millions by dividing by 10^6 the variables called EXPTOTAL.

So I create a new variable and I append it on the samdat this

way EXPTot lowercase m for million.

Now, to find my sample design and I use the svydesign function again.

I got no PSUs, so ids is NULL.

I got no strata, so strat is NULL.

I do have weights which I just calculated, svywt.

And I tell it where my sample data is, samdat.

The survey package echoes back to me if I type

smho.dsgn at the prompt: "Independent Sampling Design (with replacement)."

So it's gonna use the ultimate cluster variance estimator

which is assuming with replacement sampling.

Now, I'm not really giving myself credit for

the fact that I selected 100 out of 874 units,

which is a, you know, sizable sampling fraction.

So I could include an ad hoc fpc in here that would

probably be better than ignoring the fact that I've got a substantial sampling fraction.

So you might want to modify this example on your own and see how that comes out.

Now, how do I compute quantiles?

There's svyquantile, that's the function I use.

I'm gonna do it on that expenditure total in millions.

I tell it what design I've got, smho.dsgn.

And then in the quantiles parameter,

I can send a vector of quantiles that I want estimates for.

So I'm doing first quartile,

median, and third quartile.

And ci equals TRUE says give me confidence intervals on these things.

So here are the point estimates.

First quarter mile is 4.52 million,

median is 7.255 million,

third quartile is 11.67 million.

And here are my confidence levels.

So on the first quartile,

I've got a confidence interval -- it goes from 1.047 million seven up to 7.314 million.

So the point estimate's in the middle here,

and this is not quite symmetric.

You know, in fact, if you take a look at the third quartile,

it's definitely not symmetric there.

It goes from 7.5 million up to 23.74 million.

Point estimate's 11.6 million.

So, more toward the lower end of the confidence interval,

and that's just a characteristic of the way that

the algorithm used to calculate these confidence intervals.

You know, it's perfectly fine,

you know mathematically justified.

Now, there's a parameter called interval.type which has got

a default value of "Wald" that gives the Woodruff method.

So if I put that parameter up here,

interval.type, and set it equal to "score",

that would give me the Francisco-Fuller method,

which is more computationally intensive but potentially

is a little more fine-tuned than Woodruff.

So in summary, we don't have quite as many choices for

software as in simpler things like means and totals and proportions.

The R survey package has got the svyquantile function which we just saw.

If you're using SAS,

you can get quantiles with the proc surveymeans procedure.

If you're using WesVar which is free software from Westat,

it will do standard errors for quantiles using replicate variance estimates only.

And at the present time,

Stata does not have this facility for calculating standard errors.

Although, if users demanded it,

I'm sure they would figure out how to do it.

So that concludes module one.

We'll move on now to module two which has to do with estimating model parameters.