0:08

Like all the other lectures in this class,

we're gonna contrast between idealized settings, and what happens in real life.

In real life, data can be very messy.

And when we talk about the data pull, we're talking about all the different

components that go into going from a raw form of data into an analytic data set.

1:02

In my personal area that I work in, often we're working with large

complex brain imaging data and we have to summarize that down into

a handful of numbers that represent certain regions of the brain for example.

Another very common problem is going from data that is convenient for

one purpose into a convenient form for analysis.

So for example the way we might store the data for

archival purposes is generally not what's right for analysis purposes.

So in the process of converting there can always be errors.

1:41

So as a manager you're not going to actually probably

execute any of these solutions.

Maybe you did that in your past, but now you're managing groups of people or

individuals that are doing this for you.

So in this lecture, we're gonna give you some simple tools that you can do to help

ensure data quality in real life when you have these kind of messy data issues.

2:05

In researching this lecture, I noticed that there were a lot of tools for

the practitioners, for the people that are actually going to be doing the merging and

doing the actual data manipulations.

But relatively few for managers, so that's what we're gonna focus on here.

The first thing I'd like to mention is the simple act of creating what I like to call

Table 1.

And this is, these are basically just summary tables and

these are a great way to catch errors.

And in Epi and Biostatistics we call these tables Table 1.

And the reason we do that is because whenever you write an Epi or

a biostatistics paper there's always a first table that's

main purpose is just to summarize demographic information in your sample.

3:03

Another thing, another important one, and

I'll give an anecdote about this in a minute, is check your units.

And I'll give an anecdote just to remind exactly what I mean by that.

And then the other important thing I would say is, usually, you're meeting about

these data tables maybe weekly, maybe monthly, but keep the old ones and

always refer back to them cuz even subtle changes in these Table 1's can often

highlight important meaningful changes, an errant record or something like that.

If means that are usually quite stable jump up from one report to the next

that's often valuable information.

3:59

We we're interested in people's longitudinal change in brain volumes over

time and when we started creating our Table 1's, we weren't keeping track of

our units and it wasn't until much later on in the process,

thankfully well before we had finished the paper.

But it wasn't until pretty long in the process,

before we realized we weren't keeping track of our units.

And that people's brains were declining at a rate that was way too fast

to be reasonable.

They were going from the size of a house to the size of a pea, for

example, too quickly.

That was only because we hadn't kept good track of our units.

But once when we had a very helpful discussion about what the units were, and

then because of that discussion on what the units were,

we went on a long literature review on standard rates of brain decline in

healthy adults, this really helped inform the paper.

But more than anything, it also helped us with data quality,

since we realized that there were some errant measurements driving up results.

5:04

So, a second tool at your disposal after the modeling stage just like

we had the table one which we usually conduct before we do any modeling.

After you've done some regression modeling,

regression diagnostics are an extremely useful way to detect data errors.

And my favorite examples of regression diagnostics are from Len Stefansky,

who's at North Carolina State.

And if you look at this data set on the left, it actually goes on for

a while after that.

If you were to look at it and do some basic scatter plots,

you would see nothing out of the ordinary in the data.

It would just look kind of like noise.

However if you do a residual plot, which is plotting the errors

of the fitted values of the Russian model versus the actual observed outcome values,

versus the predicted, you see this crazy plot with Homer Simpsons on it.

And this is his way of showing how residual plots can highlight

hidden systematic patterns in our data.

Okay, so residual plots are always a good thing to look at and

always a good thing to look at for

systematic patterns, errant observations, outliers, and those sorts of things.

Of course, these always occur after you've done your modeling.

You have to have a model in order to look at a residual.

So let me give you some definitions of some of the things you might want to look

at especially from regression modeling which is my favorite kind of modeling.

So residuals, like I just said, these are the difference between the response,

the outcome that you're studying and

the fitted value, what your model predicts it would be.

So, what you wanna look for in residual plots, you just wanna make

sure that there aren't any systematic patterns and you wanna look for extremely

large values that are well outside the general pattern of the residuals.

There's another quantity called the hat values, and they consider how variable

a data row of explanatory variable is among its space of fellow predictors.

So all hat values are going to tell you is simply how outside of the norm,

the normal range of variability,

this particular row of data is relative to all it's other colleagues.

7:16

So, the DF fits, DF betas, and Cook's distance, these are all measures of

things that are kind of related, sort of like cross validation.

And in all these cases what you do is you fit the model with the point deleted,

and then you fit the model with that point included.

With a row of your data entirely deleted, and then with that row included, okay?

And then you look at how much your results changed

by virtue of leaving that row in or out.

And if you compare the fitted values.

And you get what are called the DF fits.

If you compare the slope estimates then you get what are called the DF betas.

And if you compare the slope estimates aggregated into one single number,

one single metric, that's called Cook's distance.

But all of these are influence measures.

All they're trying to do is tell you how much did things change

when I deleted this data point from my model fitting.

That's also the same logic among the PRESS residuals,

which are just residuals of the same sort.

How big was the residual for

this data point if I left it in versus I left it out?

And these are often called leave one out residuals or cross validated residuals but

all these things are useful and your analyst or whoever you are working with.

You should work with them so

that they produce these kinds of residuals diagnostic plots with your regression.

Not only do they help you evaluate things about your regression model but

they help you evaluate things about airing data points outlying observation.

And then you can drill down and see what went wrong to have

those specific observations to have such a large impact on the model fit for

example, or to be so outside of the norm of all their peers.

9:00

Another kind of interesting, in my mind fascinating

way to consider data quality is to use Benford's law.

And Benford's law is the so called First-Digit Law.

And I just mentioned here, you can read up on it and sometimes people find it useful.

And it's so called phenomenological law about the frequency distribution of

leading digits In many but of course not all real live sets of data.

And so what this means, what Benford's Law says, is if you look at the final digit,

say for example telephone numbers, they don't follow a uniform distribution.

They follow a distribution that looks something like this.

And you can try it, try some Benford Laws experiment.

And it's basically saying that the numbers should be uniform on a log scale

rather than on the natural scale.

So what you can do in terms of data quality sometimes is if your

phenomenon that you're studying.

Whatever outcome that you're studying, if it's digits should follow

Benford's law if you've checked in the past then you can do it again and

see if there was some systematic errors causing the distribution of your

the distribution of your points, not to follow this law.

Again, this only is applicable when the digits should follow Benford's Law.

As an example, if you're looking at a particular city, the first digit,

the leftmost digit of telephone numbers, isn't gonna

follow Benford's Laws because it's going to be related to the area codes.

But the rightmost digit it does tend to follow Benford's Law.

So at any rate, it's a little subtle tool that you can use for

assessing a systematic data quality error

that's perturbing the distribution of the digits of your data.

10:58

Of course you can't check every single point because that's usually too

time-consuming.

However, you can check some.

And you can use statistical procedures like random sampling to get a random

sample of rows of your data, check those rows individually, and because it's a nice

random sample, you can get an estimate of the proportion of bad rows in your data.

Okay, so you can use statistics to sample and

understand how good or bad in general your

dataset is from a relatively small sample that you check by hand, okay?

So these are some simple techniques that you can help to try and

stay on top of the real life circumstance where your data is actually a messy,

mishmash of merging errors and that sort of thing.

Since we all know that the true idealized setting doesn't occur in real life ever.