0:02

Okay let's go through yet a third derivation of least squares,

but in this way we're going to demonstrate

why least squares is thought of as sort of an adjustment mechanism.

So I'm going to write it out a little bit differently as y minus x one beta one,

minus x p beta p, where each of these are vectors.

So when I estimate the beta one coefficient, in what sense is it

adjusted for the presence of all the other variables in the model?

Before we begin, let me define my residual function for

two vectors as e(a,b) as a minus b times the inner

product of a and b over b by itself.

And this is merely the prediction for

a from b based on a regression to the origin.

And then this is the residual if I had subtracted.

1:11

As if they were known.

So then, the vector y minus x two, beta two up to x p,

beta p could be considered a single outcome.

And then x one beta one could be thought of as the predictor.

So I've just simply rewritten asterisks there as

a single predictor and regression through the origin.

So we know that asterisk has to be larger than or

equal to if we plug in a beta one.

Where beta one, where as it depends on the beta two up to beta p.

If we plug in a beta one that satisfies this criteria,

with need to be inner product of y minus x two beta two

minus up to x p beta p, we'd want that comma,

and x one, all over norm x one by itself.

2:39

Well, before I do that,

notice this is equal to the inner product of y and

x one over x one by itself, minus the inner product of x two and

x one over the inner product of x one by itself times beta two.

Minus all the way up to inner product of x p and x one.

3:20

beta one there, then I'd like you to churn through the calculations.

What you get is e of y and

x one minus e of x two and

x one times beta two all

the way up to e of x p and

x one times beta p squared.

We have to have gotten smaller

because we'll have plugged in the optimal beta one holding the other ones fixed.

4:02

So now we're back at the exact same equation with one fewer coefficients.

Instead of having p coefficients,

now we have p minus one where we've gotten rid of beta one.

And now instead of y, we have the residual having a regressed x one out of y.

We have the residual for x two having regressed to x one.

So in essence, we've simply gotten smaller

by taking the linear association with x one out of every other variable.

Out of every other predictor in the outcome.

And we've gotten smaller.

So, we could repeat this process, because notice, these are all,

this is just the same exact starting equation with one fewer.

4:45

One fewer regressors, and our vectors now look a little bit weirder,

because they're these residual, they're the output of this residual function, but,

ostensibly, we're back at this same thing.

So we know that we can get smaller if we repeat the exact same process.

So this is going to be greater than or equal to.

If I simply take the residual,

the residual of the residual

where I've now, let's

supposed I've held beta

three up to beta p fixed,

and then I'm getting

rid of this term here.

Then I would get e of x two comma

x one minus e of e of x three, x one and

e of x two, x one beta three and

all the way up to e of e x p comma x one.

Now regressing out e of x two and x one.

Okay.

Times beta p.

So now we've gotten rid of this regressor and

this coefficient by regressing it out of every term.

And then you can see, as you iterate through this process, you regress out

until you get to the beta p variable, you'll find that the beta p variable

estimate is then a regression to the origin, with just that variable.

6:43

And the outcome where we've iteratively regressed out the linear association

of every regressor.

So we took and regressed x one out of everything.

Then we took those residuals and took the residuals of x two and

regressed it out of all the other residuals.

And then we took the residual of x three and

then having took the residual of it having regressed out of.

And then in that sense, what we'll see is that, and then we get to this

beta p effect, which will just be the remaining regression for the origin.

So you could see this would be actually an easy way to do linear regression.

Where all you needed to know how to do was this residual function.

No matrix inversion required.

So that's kind of a neat result.

But also, a couple of other things come to mind.

The first thing that comes to mind is, because we did it in an arbitrary order,

we just picked working towards the last coefficient.

We could have also work towards the first coefficient and

we could have done it in any order, so

that you can see that it doesn't matter which order we take the residuals in.

All that matters is that we're iteratively taking residuals.

And I find this process really helps me understand in what sense

linear regression is adjusting for these other variables.

Because it's taking out the linear association of all the other variables

from everything else.

And I should note that

8:12

thinking this way is not just restricted to separating everything in the vector.

So, suppose for example I had y minus x one

beta one minus x two beta two, where now,

x one is m by p one and x two is n by p two.

So I've broken my x matrix my design matrix into two parts x one and x two.

8:51

Okay, so now this term, y minus x one beta one, is a single term,

and I know what by solution for beta two would have to be.

Beta two as it depends on beta one, my beta two hat as it depends on beta one,

would simply have to be x two transpose x two inverse x two transpose.

Then times my outcome, but because x one and beta one are fixed,

it'll be that which works out to be x two transpose x two inverse,

x two transpose times y, which is the coefficient.

And then x two transpose x two inverse x two transpose x one times beta one.

Okay, so, this term is the coefficient having

9:52

This term is the coefficient if I were to only have regressed x two on y.

And this is the collection of coefficients I would get if I regressed every single

column of x one as an outcome and x two as a predictor.

When I plug these, this estimate back into here, into beta two, what do I get?

I get y minus the hat matrix x two for x two.

X two transpose x to inverse,

x to transpose y and minus and then,

I'm going to just write this out in a more

convenient form as i minus that times y and

then, minus I minus x two transpose x two

transpose x two inverse x two transpose times x one and

then, Beta one.

Okay.

So now, what is this?

This is the residual of y having regressed out x two,

and this is of course smaller than our original equation.

11:13

And this is the residual, for

having regressed x two out of every column of x one.

Okay, so one way to think about our estimate for beta one is first get rid of

all the effect associated with x two out of y and

x one, and then perform the regression with just those sets of residuals.

And again, to me this really helps me understand in what sense

regression is doing adjustment.

But notice again this is the same argument as above,

we're just doing it with matrices now.

Okay, so I find this way of thinking, even though it's a little confusing and you

would never actually program the computer to fit least squares this way, I find

it a very useful way to think about linear regression and what it's accomplishing.

11:59

And conversely there's nothing in particular about holding beta one

fixed first, we could have held beta two fixed first.

And so what we see is that every coefficient in least squares is obtained

this way, by having regressed all the other regressors out of both y and

the predictors associated with that coefficient.