0:00

So one issue with either the SVD or

principal components analysis is miss, is always missing values.

So real data will typically have missing values, and the

problem is that if you try to run the SVD

on a data set that doesn't have or that has

some missing values, like you see that I've created here.

You can see that you get an error.

You just can't run it on a data set that has missing values.

So you need to do something about the missing values.

Before you run an SVD or a PCA.

0:26

So, one possibility and there are many others though, is the,

use the impute package which is available from the bioconductor project.

and, and just impute the missing datas, sorry the missing data points and so

that you can have a value there, and then you can run your SVD.

And so this approach you.

This code here uses the impute.knn function which takes a, a missing row or

missing values in a row, and imputes it by the k nearest neighbors to

that row.

So if k, for example, is five, then it will

take the five rows that are closest to the row with

the missing data, and then impute the data in that

missing row with the kind of average of the other five.

And so, once we've imputed the data with

this impute.knn function we can run the svd.

You can see, it runs without error.

And we can, we can kind of plot the first singular

vectors from each of them.

And so you can see that the on the left hand side, I've got the,

the data, the, kind of the, the

first singular vector from the original data matrix,

and the second, on the right-hand side,

I've got the, server singular vector from the

data matrix that was in, that was kind of where the missing data was imputed.

Now, you can see that they're roughly similar.

They're not exactly the same, but the imputation

didn't seem to have a major effect on

the on the running of the svd. So

1:46

this is a final example here, it's just kind of an interesting example.

I just want to show how you can take an

actual image, which is represented as a matrix, and

kind of and, and, and kind of develop a

lower dimensional or lower rank representation of this actual image.

So here's a picture of a face.

It's a relatively low resolution picture of a face, but you can see that there

is a, you know, a nose and ears and two eyes and a mouth there.

And so what we're going to do is we're going to run the svd

on this face data and look at the variance explained.

And so you can see that the first

singular vector explains about 40% of the variation.

And then the second is about say 20 some percent.

And then the third one is about maybe 15%.

And so if you look at say the first five to ten singular

vectors they capture pretty much all of the variation in the data set.

And so we can, we can see, we can actually

look at the image that's generated by say the first

singular vector, or the first five, or the first ten.

2:53

that is, that uses fewer components than the original data set too.

So here I'm creating one that just uses

the first principle component the first singular vector.

I'm using one that takes the first five altogether.

And then another one that takes the first ten.

And, and so we can take a look at what this approximation looks like.

So the first image here, which is all the way

on the left here, just uses a single singular vector.

And you can see that it's not a very good, it's not a pretty picture so to speak.

There's not really a face there.

There's not much you can see.

But it's asking a lot to represent an entire image using just a single vector.

So, if we move on to the second one from the left, you

can see that basically most of the, of the key features are already there.

So this uses the first five singular vectors.

And you can see that clearly there's a face.

There's two eyes, a nose, and a mouth, and two ears.

3:49

If you move on to the next picture, which is

letter C, here, you can see that it's a little bit,

kind of has a little bit more definition.

This is using the first ten pixel, singular vectors.

And, but it's not very different from the second one, which uses, only used five.

And then the very last one here on the right is the original data set.

So, you can see that if you go, if you use

just a few, a singular vector, maybe up to five or ten.

You can get a reasonable approximation of this face without

having to kind of store all of the original data.

So, this is an example

of a kind of data compression type of approach.

that, that, the singular value decomposition can can generate.

Now data compression and kind of statistical summaries

are kind of two sides of the same coin.

And so if you want to summarize the data set with the, with a,

with a smaller number of features the

singular value decomposition is also useful for that.

4:44

So just a couple of notes and further resources

for the singular value decomposition and principle component analysis.

One of the issues is that the scale of your data matters.

So if you have, for example, it's common to

measure lot's of different variables that come on different scales.

And that can cause a problem, because if one

variable is much larger than another variable, just because

the unit's so different that will tend to drive

this, the principle components analysis over the singular vectors.

And so that may not be particularly meaningful to you.

And so you want to look at the, see that the, kind of the

scale of the different columns or rows are roughly comparable to each other.

5:21

the, as we saw in the, one of the, in the example with the

two different patterns, the principle components and

the singular vectors may mix together real patterns.

And so, the patterns that you see may not represent the kind of

separable patterns but they may be

patterns that may, that are mixed together.

The singular value decomposition can be computationally intensive if you have a

very large matrix so that's one, that's something to keep in mind.

We use relatively small matrices here, but

of course computing power is getting ever more

5:48

powerful and and there there are some highly optimized and

specialized matrix libraries out there

for computing the singular value decomposition.

And so this can be done on, on lots of kind of practical problems.

6:04

And so here are a couple links to kind of further resources

for what, how to use principle

components analysis in the singular value composition.

And, and also there are other kind

of approaches that are similar to this.

But are a kind of different in many of the details.

You may hear about these approaches.

Things like factor analysis, independent

components analysis and latent semantic analysis.

And these are worth exploring, but are related to the, kind

of the basic ideas behind principle

components analysis and singular valued decomposition.

Which is that you want to find the kind

of lower dimensional representation that explains most of the variation

in the data that you see.