0:00
In this lecture we're going to talk about a particular skill.
Which is reading the output from aggression.
Now remember early on in this course, that I got one
reason to model is to be an intelligent citizen of the world.
That there's just a certain amount of modeling understanding that
you have to have to get buy and really contribute.
Well you may be doing some project, and what's going to happen is somebody's
going to give you some sort of regression output that looks like this.
And what you've got to be able to do is, read it,
understand it, know what this is saying.
So we see some words and we already know like R
squared, and we see things here like, say coeffic, coefficient intersect.
What we would like to do is we had
to fully understand what output that this is saying.
Okay, so the first thing to notice, we've got to think about what's really going on.
We see regression output, but it really is this, it's a
linear model, but it's a linear model based on multiple variables.
So remember
before we had Y equals mx plus b, right that was our model.
The linear model were Y depended on x.
What we're going to do now, typically we see regression output, is that Y, which is
your dependent variable, depends on more than one,
so it's like m1x1 plus m2x2 plus b.
So for example, suppose, again let's go back to the health example, suppose Y
is health outcomes.
Well x1 might be how many hours of exercise you get, x2
might be how many hours of sleep you get, and so on, right.
So what you've got is your dependent variable, 1, depends on a lot of stuff.
So when you look at regression output, like when you look at stuff that looks
like this, you see there's more than one x, there's an x1, and there's an x2.
1:35
Let's do an example, you know, watch the sound of it.
Supposing we're looking at student test scores.
Right?
So that's y, that's our dependent variable.
When it could depend on a couple things.
It could depend on t, which is teacher quality,
and it could depend on z, which is class size.
So we just write down a simple linear model that says
y, equals c times t, plus d times z, plus b.
B is again our intercept, right?
That's because I put my intercept before, just like Y equals MX plus B.
Now when we think of this column, what should we expect?
What we should expect
is that as teacher quality gets better class size, or scores get better.
So we should expect c to be bigger than zero.
But we should expect as class size gets bigger,
that the school performance, or class performance should fall.
So therefore we should expect d, to be less than zero.
So, when you see a model like this one of the first things you want to
do, is you want to sort of come at it with some expectations, some preconceived
ideas about what you think is going to be true.
That way when you look at your output you can decide, is it surprising?
Right? Or is it not surprising?
So, let's go back, and take just a generic model,
where we have Y equals aX1 plus bX2, plus c.
So c here is going to be our intercept, right?
That's our intercept, and a and b, these are the coefficients of
our independent variables.
What we want to do is we want to look at
the outputs, going to tell us something about those coefficients.
So here we go, here's some regression output.
Looks a little scary, but let's, just relax a second.
So let's look first and see what we see, first
we see this thing that says R squared is 0.72.
What is that telling us?
Well, we already know, that's saying, there's a whole bunch of variation
in the data, and 72% of it, was explained by our model.
That means a linear model, in this case is explaining 72% of the variation.
That's totally great.
Standard error, 24.21, is telling us on average,
what was the standard deviation of the model.
So how far from the mean were things, so this is telling us on average about 24.
And then this observation thing's 50, is going to say we had 50 data points.
So, we had 50 data points on average, it was 24
away from the mean, and we could explain 72% of that variation.
So, you know, not a bad model.
Alright, now we look down at this part down here.
This whole part of it, this is telling us something about what
our linear regression model is saying about the coefficience of the intercept.
So the first thing we notice is the
25 here is the intercept, and so that's saying
our final regression equation is going to look like
y equals something times X1, plus something times X2
plus 25. Alright,
4:18
now this next term here this 20 is telling us that the coefficient of X1 is 20, so
it's going to be 20 X1. And then this 10 corresponds to X2, right.
Plus 10 X2, so this is the space we're telling is,
our equation is Y plus 20 X1 plus 10 X2, plus 25.
Now let's suppose, let's go back to our,
the previous example, which I was talking about.
Let's suppose that X1 was teacher quality.
So this would say Y, which was student test scores,
are increasing in teacher quality so that's what we'd expect.
But suppose X2 is class size and here we have a
ten, then we get the test scores actually increasing in class size.
Well then we say hmm, this is sort of surprising to me because I
accepted class size to have a negative
coefficient, and it actually has a positive coefficient.
So, we might think maybe
our data's wrong, maybe my intuition's wrong, so let's look a little bit deeper.
So some things to look at first as
we should note, we've only have 50 observations.
So 50 observations isn't very many.
So it could be that maybe these coefficients aren't right.
Well, how do we know that?
Well, this is where you want to look at this column right here, with SE.
5:28
SE stands for standard error. And what it means is, it's sort of how,
what's the error in our coefficients?
So, for example, here we've got a coefficient
of 25, and it says the standard error
is 2, so what that;s meaning is, let's go back, remember think of our bell curve.
Sort of saying, our model is guessing, that the coefficient of this thing is 25.
Right that the coefficient is 25?
But we've got a standard error of 2, so that means, if we went between 23 and 27.
We'd be right,
68% of the time.
So what it say is this 25 is a guess based on the day and what this standard error of
2 is telling us is that, well 68% of the
time, it actually, the coefficient will be between 23 and 27.
So it's sort of saying, you know maybe its 25, but
you know it's probably almost for sure between 21 and 29.
Well, let's look at our X1.
We've got a coefficient of 20, but the standard error is only 1.
So what that's saying is,
we can be really sure, that the coefficient on X1 is between like, let's
say, 17 and 23, and we can be incredibly sure it's between 16 and 24.
But now let's look at this last one, X2.
The coefficient was 10, and the standard error was 4.
So if I draw my bell curve, and let's go over here and
draw a big bell curve here, make sense of this for a second, right.
What it's saying is from the data I'm estimating that
coefficient of 10, but I've got a standard error, right, of 4.
So that means there's a 68% chance that it lies between 6 and 14.
And there's a 95% chance it lies between two and 18.
And there is actually at least a 2% chance it is you know, below two.
Right?
So there's some chance that this coefficient actually, instead of being
ten, could be negative.
7:17
So you think, well, why don't they tell us that?
And the answer is, they do.
So this column right here, this is P-value, that
tells you, the probability that the sign is correct.
Or, I'm sorry, the time is wrong.
So what this is saying, is, there's no way those two signs are wrong.
If we think about drawing this bell curve where we get
an estimate of 25, there's no way the real coefficient is, is
negative.
But when we get down to X2, it's saying look, there's about a 1.5% chance that
this coefficient of ten, instead of being positive, it actually should be negative.
8:02
Alright, take a step back for a second, what do we got?
So we see this
regression output, it tells us a bunch of stuff.
It tells us first what the r squared is, how much of the data did we explain.
Second it tells us how many observations do we have.
A lot of observations or not many.
Third, it tells us how much variation was there in the data to begin with.
And the answer is, on average 25, 24.21.
So, you know, quite a bit of variation.
Then it tells us the values of the intercept and the coefficient,
right? These, these are estimates, 25, 20 and 10.
And so
8:35
it tells us you know, this is probably positive and this
is probably positive, and it gives us a sense of magnitude.
So it tells us sign and magnitude.
But it also tells us in this P-value thing,
how sure we are that those coefficients are correct.
Now keep those.
We can't be sure it's 25, but we can be,
tell us sure we are that the coefficient is actually positive.
So here it's saying, we're really sure it's positive.
Right?
because there's almost no chance for making a mistake.
But for X2,
well there's a one and a half percent
chance that maybe there is a mistake there.
So if this were, again, a regression of test scores on, teacher quality
and class size, what we can
say is, teacher quality defiantly improves performance.
And there's a lot of evidence that that's true.
And we could say with class size, well, even though this study goes
the wrong way, it's possible since we only have 50 data points, right?
That maybe if we did another study it could go in the opposite direction.
And that's true as well, there's a lot of studies on class size that do in fact show
that as the class size gets bigger the students
do better, even though that is sort of counter-intuitive.
There's more studies that show the opposite, right?
That as the class size gets smaller, the students do better.
9:41
So big things to take away when you look at that regression output.
The first thing is look at the sign, look at
every coefficient and ask, does Y increase or decrease in X?
Now before you look at it though, when somebody says, oh, I've
run a regression model, you should say well what are your variables?
And then what you should do is you should form expectations about
what you think the signs of those variables are, the coefficients are.
Then when you look at it you can say hmm, does
the variable have the effect that I thought it would have?
So if your looking at sales for your firm, you might want to say, well geez,
you know, the coeffecient on advertising is negative,
the more we advertise the less we sell.
That would be totally counter-intuitive. Right?
But if the
coefficient on advertising were positive, that the more you
advertise the more you sell, then it would make sense.
Then we want to look at magnitude, and we want to say okay, how
big of an effect on Y, does a one unit increase of X have?
And if it's got a big coefficient, that means,
wow this is something I should pay attention to.
And that's got a small coefficient, it's
something maybe you shouldn't pay attention to.
10:59
of those coefficients are, right?
So is the sign positive, and what's the magnitude of that coefficient?
And we also get that P-value thing, which tells
us what's the probability that the coefficient's actually wrong.
[LAUGH]
That maybe, you know, the data's so noisy we can't say for sure.
So what's great is, you know, if you've got data out
there, you can throw that data into a linear regression model.
Right?
If you've, if you've got an idea of what variables you want to include.
And you can get some output, and then from that output you
can get an understanding of, you know, how good is the model?
What's the R squared?
What is the sign of the magnitude of the coefficients,
and how confident can we be that those coefficients are right?
So it's actually a really useful way
for making sense of the world, and as we've shown in the previous lecture,
right, it's usually better than we are
at figuring out how the world's going to work.
Thank you.