0:07

Let's continue now with the mice example for multiple imputing missing data.

Now this is what's called the margin plot,

which is the handy way of looking at at least small data sets and

seeing the pattern of missingness.

So the first thing I do to draw one of these is require the VIM package,

and it has a function in it called margin plot, which is nice.

So you can find this example in the Van Buren paper that was

referenced in the last video.

So I feed it the nhanes2 data set.

I'm going to plot total serum cholesterol and BMI.

I set some colors here using parameters that margin plot takes, and then the cex

parameters set the size of labeling in the plot.

Pch = 19 is just a particular plotting character

that gets filled in with a color.

So what have I got here?

I've got, on the horizontal axis, I've got total serum cholesterol.

On the vertical axis, I've got body mass index.

1:35

Then, along the sides, I've got two boxplots.

So the blue one here on the left,

this one right here is for

cases that all have BMI present.

The red boxplot right here is for cases that have got BMI present,

but are missing serum cholesterol.

And then you've got the corresponding sort of thing down here for

cases that have got serum cholesterol present, but missing BMI.

And we've got a total of 9 cases that are present on

BMI missing serum cholesterol, 10 cases that

are present on cholesterol, but missing BMI, and

then we've got 7 cases right here that are missing both.

2:40

Now another thing that is interesting to note from this plot is

if I had MCAR missing completely at random, in other words,

the missingness is just a random draw from the total sample.

I would expect to see boxplots that looked alike.

Now for the cases that have got BMI and have got serum cholesterol,

their boxplot should look the same as the ones that have got BMI,

but they're missing serum cholesterol.

3:16

The boxplot for BMI, that is not the case here, on either axis.

You can see on the vertical axis these two have drawn arrows to boxplot looks

substantially different, and the same thing down here on these two boxplots.

So that means I needed to account for covariance to have

any hope of getting imputations that I would consider to

be unbiased predictions of what the actual values are.

3:51

Now how do I do the imputation?

In this simple little example it's easy, I call the mice function,

I send it nhanes2, I set a seed for the random number generator

in case I want to generate the same set of imputed values again, and

I store the whole thing in nhanes2.imp for imputations.

The summary function operating on this will print information about

number of Mls, the imputation method for each variable, and

the covariates used to impute each variable.

4:29

If I want to look at the complete datasets one at a time,

I can use the complete function.

So if I send complete nhanes2.imp, and then action equals k as a parameter,

we would retrieve the kth completed dataset.

So that's handy, you can see what you actually did.

4:51

Now let's take a look at what summary gives us.

So there are a few things to note, it echoes the call to the function here.

The number of imputations by default is 5, but you can control it.

You could do more than 5, if you wanted to.

The number of missing cells or values for

each column in the data set is reported here, and

then it gives you in this row here the imputation methods that are used.

So age is not missing, so I don't need to impute for that.

BMI is continuous so the default is predictive mean matching.

Hypertension is categorical, the default is a logistic regression.

Serum cholesterol is also continuous, so

I get predicted mean matching as the default there, but you can control that.

if you've got a better idea of how to do it,

you can use one of the other methods that are available.

So the VisitSequence, as it shows here is that

I impute BMI first, hypertension second,

serum cholesterol third in the sequence of imputing.

Now the other piece of information here is a matrix

that tells us what covariates were used to impute each variable.

So what you see here is, this row of zeros means that age did not need to be imputed,

so nothing was used, no covariates there.

For BMI on the other hand, age and hypertension and

serum cholesterol were all used to form a model to impute for BMI.

So everything except itself was used to impute BMI.

Hypertension, we see a similar thing, age and BMI and

total cholesterol were used, and then for

total serum cholesterol, age, BMI and hypertension.

Now you can control that if you want.

If you know a better model involves just the subset of the variables,

you can specify that to the function.

7:18

So just to show you a little example of how you'd go about using this,

once you do the imputing, I fit a linear model.

So you can say with the nhanes2.imp object that contains the imputations,

I want to fit a linear model using the lm function.

So here I'm regressing cholesterol on age plus BMI,

and I then do a summary on the fit.

But interposed in here is the pool statement,

which comes with mice, so the pool statement

8:01

essentially results in those multiple imputation formulas being used to

summarize the data, so the estimates will be averages across the completed data set.

The standard errors that are reported will be using that variance formula that

involves the average of direct estimates plus an increment due to the variation

between completed data sets of the estimates that use imputations.

8:32

And the variance formula was put together with that special multiple

imputation formula.

So I round things off to two decimals just to give us less to look at.

And you see here, here are the regression parameters in this column.

Here are the standard error estimates using that particular multiple

imputation variance formula, the square root out of it, t stats, and

then I get some other things on the end here called fmi and lambda.

Fmi is called a fraction of missing information,

which has to do with the size of the b parameter in

that variance formula, and then lambda's similar.

It is the proportion of total variance attributable to imputations.

So let's look at lambda.

This says that of the variance for the intercept,

27% of that is attributed to imputations.

So in a big data set where you've got a relatively small number or

proportion of imputations, these numbers will be much smaller,

but this is just a toy example.

So that summarizes the simple example of how you would use the mice software.

So mice is nice.

I'd recommend it.

It's very flexible, and it's quite popular for both doing the imputations and

properly reflecting the effect of those imputations on variances.