0:01

So let's talk about another instance of an orthonormal basis that comes up

quite frequently.

So imagine our x is n by p and let's still assume that p is less than or

equal to n, but imagine if we have a large number of subjects.

Or records, so n is large, and then p is also large.

So we want some way to reduce the dimension of x to make it a little

bit more manageable.

0:39

d is a p by p diagonal matrix of singular value,

and v is a v transpose is a p by p matrix.

And these are such that u transpose u equals v transpose v equals I.

Okay, so one thing I want to know, is imagine for the time being

that x has been centered in the sense that all of its column means are zero.

1:08

And consider x transpose x, which is effectively

the variance covariance matrix of the x matrix disregarding the n minus 1.

Well, that's equal to using this result and this result.

That's equal to vD u transposed uDv.

Which is equal to v D squared v transposed.

So the eigon value decomposition of x transfers

x is related to the singular value decomposition of x by itself.

1:46

The squared singular values,

the eigon values from the eigon value decomposition helps summarize the variance

in the x transpose x matrix, which is itself a variance-covariance matrix.

And these are usually ordered so that the larger d squared values are earlier.

So they're usually ordered in decreasing numbers.

So the first one is the largest.

The second one is the second-largest and so on.

And so, consider the fact that the trace of x, transpose x,

is equal to the trace of vD squared v ranspose.

And then I can move that v over there, since trace ab is trace ba.

v transpose v is I.

So that trace of x transpose x is equal to the trace

2:39

of the sum of my eigenvalues, squared.

And so what this means is the eigen values are summarized in the variability,

in the sense that the trace is the total variability In my x transpose x matrix,

by taking all the diagonals, sum of all the diagonals, sum of all the variances.

Okay, so what we could do is take the first three components of our

3:45

V times d inverse is equal to u.

So a way to think about how we get at the scores,

these are the so called scores, the u, the vectors

associated with u is by multiplication of x by v.

So what v does is it combines the columns of x

in such a way that it gives us these scores.

And then the D inverse is sort of a normalization term, right, the D's are,

you can think of as variances, so multiplication by DM,

versus sort of like, normalizing in the sense of dividing by a standard deviation.

Okay, so, what we could do if, to every column of v, which

is an eigen vector is associated an eigen value which is an element of d squared.

We might say the top three and

then only say take the first three elements of this u matrix okay.

So by only taking the first three elements of v,

we'd only be taking the first three elements of d.

Or we could just of course do the singular value composition which will give us U,

D, and V and just take the first three vectors of U.

So we could then try to minimize y

minus u gamma let's say squared, okay.

Where I'm going to put a little three under my u because I

just happened to grab the first three columns of u and

what this would mean is I'm trying to regress y

with the design matrix u but my u was selected in a way to

capture as much variation as possible as I could in my x.

Of course I'm just using three as an example,

you could use any number of the, number of columns of U to do this with.

But you want to explore what percentage of the variation they explained, and

how tolerable that percentage is to your goals.

5:52

But at any rate, our discussion from orthonormal bases notes that because of

course U is orthonormal, grabbing any three columns of U,

especially the first three columns of U, is also going to be orthonormal.

And so our estimate of gamma, our gamma hat

is just going to be u 3 transposed times y.

Okay so what we find is that the way in which we get sort of

principal component regressors is simply by taking

the singular value decomposition of our centered X matrix,

taking the relevant columns from our score.

Our vector last singular values which if we think

about it in terms of principal components, as our scores and

then if we simply multiply them by y multiply the transpose of them times y.

We actually get the associated coefficients.

So this just goes to show how we can use these nice operations that

we get out of squares in this particular case.

So using the singular value decomposition to come up with the orthonormal

basis I think represents of the three most important bases,

concepts and statistics, certainly I would describe wavelengths,

transforms, and principal components basis as the three.

And I think you can see that in this case

it fits very nicely into the topic of regression.

And it also fits very nicely if we have a large

X matrix with a lot of columns that we want to summarize.

One caveat, I would suggest to be careful of.

Again, we get U.

We can think of U as these linear combinations of our columns of X.

If the units of x don't make sense to combine

then this procedure may not make a lot of sense to do.

So if the first column of x is a particular kind of units and

the second column of x is a different kind of units then the interpretability

of your scores may really suffer as a result.

So again, there's a lot of intricacies to do this.

And I think if you wanted to learn more about this,

a class on multivariate statistics would be the way to go.

But I just wanted to reinforce this point that when we have a design matrix that's

orthonormal, we work out with a really simple solution for the coefficients.

Okay, and next we'll go through a coding example where we go through some

of these sorts of examples.