0:02

We'll finish module one on Basic Estimation with an example of estimating quantiles in R.

Â Now, the thing that happens with quantiles is you need

Â special methods in order to get precision estimates;

Â the formulas that we've seen earlier don't work because

Â a quantile estimate is a nonlinear kind of statistic,

Â which means that you've got to use a different approach.

Â So, the standard thing is compute a confidence interval on a quantile first,

Â and then you back into a standard error,

Â if you really want that.

Â So the way it's done is you calculate the length of the confidence interval,

Â and then you pretend like it's a symmetric interval based on a t-distribution.

Â So the length of a t-interval will be

Â twice the multiplier from t-distribution times the standard error,

Â because you're adding and subtracting for a symmetric interval.

Â And then, you know that length based on

Â the calculation that one of these methods is gonna give you for a quantile,

Â and then you solve for the standard error.

Â So that's how it goes.

Â So for example, if one minus alpha is 0.95,

Â there's a huge number of degrees of freedom,

Â so the t is the same as normal,

Â then the standard error is gonna be estimated as the length of the confidence interval

Â divided by two times 1.96.

Â Now, there are a couple of options for getting confidence intervals.

Â These are named after the people who invented them -- Woodruff and Francisco-Fuller --

Â and both those are available in the R survey package.

Â So let's take a look at an example.

Â The first thing I'm gonna do

Â is require the PracTools package because I want to use a dataset in there.

Â Then I require sampling because I'm going to select a sample from a particular dataset.

Â And the one that I'll use here is called the Survey of Mental Health Organizations and,

Â it's in PracTools, it's called smho.N874.

Â 874 is just the number of hospitals in that dataset.

Â So I'm gonna select a pps sample just so I'll

Â have a sample with different weights for the units.

Â First, I set the measure of size equal to the number of inpatient beds in each hospital.

Â Now, some hospitals have zero inpatient beds because they don't treat inpatients.

Â So to take care of that,

Â I'm re-coding all the sizes that are less than or equal to 5 to be equal to 5.

Â That way, I get all cases have nonzero measures the size.

Â I compute inclusion probabilities pk with

Â this function from the sampling package called inclusionprobabilities.

Â You know, it's a long name but it's clear what it means.

Â So I assume that the measure of size is size which I calculated up here,

Â and I'm going to pick a sample of 100.

Â So, the inclusionprobabilities is

Â nice because it figures out if you've got any units that are so big,

Â they'll be selected with certainty.

Â So when I summarize pk, I get a small minimum,

Â 0.007, all the way up to one, which are units selected with certainty.

Â Now, you might not want to design a sample with the weights vary that much,

Â it might not be that efficient.

Â But this is just an example we're doing here.

Â Now, how do I select a pps sample?

Â First, I used the set.seed statement with a constant number in here.

Â That just starts the random number generator at

Â a particular place so I can repeat this if I wanted to.

Â And then I use the UPsystematic(pk) function to draw a sample.

Â So, in whatever order the smho population is,

Â this will take a random start and then it'll skip

Â systematically down the interval or the list of hospitals and select 100.

Â This is after extracting those that are selected with certainty.

Â So I save that, the result of that,

Â in a vector called sam which is a string of zeros

Â and ones for whether you're in the sample or not, 874 long.

Â I use the getdata function to retrieve the data,

Â and then I'm gonna bind on the survey weights.

Â So I use this cbind function with the data I just extracted,

Â samdat, and then I append a new column that I'm gonna call svywt.

Â And that's just one over the selection probability.

Â And I'm extracting, as you see here,

Â I only extract the pieces of pk where sam is equal to one.

Â In other words, the sample cases.

Â And then I'm going to do a calculation using expenditure totals.

Â So what I do is,

Â these are very large numbers in the data file,

Â so I'm converting them to millions by dividing by 10^6 the variables called EXPTOTAL.

Â So I create a new variable and I append it on the samdat this

Â way EXPTot lowercase m for million.

Â Now, to find my sample design and I use the svydesign function again.

Â I got no PSUs, so ids is NULL.

Â I got no strata, so strat is NULL.

Â I do have weights which I just calculated, svywt.

Â And I tell it where my sample data is, samdat.

Â The survey package echoes back to me if I type

Â smho.dsgn at the prompt: "Independent Sampling Design (with replacement)."

Â So it's gonna use the ultimate cluster variance estimator

Â which is assuming with replacement sampling.

Â Now, I'm not really giving myself credit for

Â the fact that I selected 100 out of 874 units,

Â which is a, you know, sizable sampling fraction.

Â So I could include an ad hoc fpc in here that would

Â probably be better than ignoring the fact that I've got a substantial sampling fraction.

Â So you might want to modify this example on your own and see how that comes out.

Â Now, how do I compute quantiles?

Â There's svyquantile, that's the function I use.

Â I'm gonna do it on that expenditure total in millions.

Â I tell it what design I've got, smho.dsgn.

Â And then in the quantiles parameter,

Â I can send a vector of quantiles that I want estimates for.

Â So I'm doing first quartile,

Â median, and third quartile.

Â And ci equals TRUE says give me confidence intervals on these things.

Â So here are the point estimates.

Â First quarter mile is 4.52 million,

Â median is 7.255 million,

Â third quartile is 11.67 million.

Â And here are my confidence levels.

Â So on the first quartile,

Â I've got a confidence interval -- it goes from 1.047 million seven up to 7.314 million.

Â So the point estimate's in the middle here,

Â and this is not quite symmetric.

Â You know, in fact, if you take a look at the third quartile,

Â it's definitely not symmetric there.

Â It goes from 7.5 million up to 23.74 million.

Â Point estimate's 11.6 million.

Â So, more toward the lower end of the confidence interval,

Â and that's just a characteristic of the way that

Â the algorithm used to calculate these confidence intervals.

Â You know, it's perfectly fine,

Â you know mathematically justified.

Â Now, there's a parameter called interval.type which has got

Â a default value of "Wald" that gives the Woodruff method.

Â So if I put that parameter up here,

Â interval.type, and set it equal to "score",

Â that would give me the Francisco-Fuller method,

Â which is more computationally intensive but potentially

Â is a little more fine-tuned than Woodruff.

Â So in summary, we don't have quite as many choices for

Â software as in simpler things like means and totals and proportions.

Â The R survey package has got the svyquantile function which we just saw.

Â If you're using SAS,

Â you can get quantiles with the proc surveymeans procedure.

Â If you're using WesVar which is free software from Westat,

Â it will do standard errors for quantiles using replicate variance estimates only.

Â And at the present time,

Â Stata does not have this facility for calculating standard errors.

Â Although, if users demanded it,

Â I'm sure they would figure out how to do it.

Â So that concludes module one.

Â We'll move on now to module two which has to do with estimating model parameters.

Â