0:00

Let's talk a little bit about unsupervised learning techniques.

Remember, that the difference from

supervised learning is the absence of a target variable.

In general, these techniques are applied to

reveal a hidden structure in the data if it exists.

It's particularly useful in

multidimensional settings when our intuition fails us miserably.

And we can't just plug the data and see everything straight away.

Often, these techniques become a part of a larger task

helping to direct further analysis. Let's be more specific.

If we don't have clear targets,

then it's sensible to have some questions about

the data that we'd like to answer using unsupervised learning techniques.

Here, you can see a few examples of such questions.

The first one, can we reduce dimensions while preserving information?

It makes sense when we are faced with

too many features and expect that not all of them contain useful information.

The word information here is little bit tricky

because it's not really defined and based on the definition,

you might want to do different things.

The second question is about clusters in the data.

Sometimes, the data at hand is not homogeneous and

there could be groups of points that can be clearly distinguished.

Usually, these groups should be treated

differently and it's important to understand if they exist.

Each of these tasks usually gets set in

formal terms and is solved by an appropriate method.

Now, let's focus on the first question and discuss one of the basic methods to handle it.

Before going into formalities,

let's build some intuition.

Take a look at this picture that shows points in two dimensional space.

Imagine that we have to throw away one of these dimensions.

Which one would you pick for throwing away?

Note that effectively, we are choosing which of

the two projections is more meaningful for us.

Well I guess that many of you would throw

away the X_2 as the size of projection on X_1 seems to be larger.

Okay, so far so good.

But what if I ask you the same question for this picture?

I personally would be puzzled,

as our previous logic doesn't seem to work here.

Looks like none of the initial axes is a good candidate for elimination.

But if you don't like the initial axes,

let's change them first.

Here you can see my suggestion on how we could do that.

In this new axis,

our logic starts working again and we can easily disregard X_2 prime.

Let's analyze this little example as it illustrates a few important points.

One of them is that our initial axis might not present the information in the best way.

Typically, the measurement are given to us

and we seldom can affect the process they were collected.

And other point is that intuitively we can see the future to be more

informative in case it accounts for more variance in the data.

That is after projection on that axis.

The variability in the data doesn't change much.

Turns out there is a method that rotates

the space and finds a good directions with this property.

The method is called PCA or

principal components analysis and we'll discuss it in more detail.

The main parameter characterizing how data is spread around its mean is variance.

By now, you all should be familiar with it,

but we'll go further and define covariance which is

basically a measure of linear dependency between variables.

By the way, correlation is just a properly normalized covariance,

and variance is covariance of a feature with itself.

Calculating covariances for all the pairs officious reveal spatial structure of

the data and allows to identify those desired directions we've discussed earlier.

Let's see how it works more formally.

Remember, that we are dealing with a multidimensional space,

so we have a matrix of all features which is denoted as X in this slide.

Each row of a matrix is a point.

So this matrix shows four points in three dimensional space.

Computation of all the covariances among

all the variables is a simple matrix multiplication.

The covariance matrix has the size of number of

features by number of features which is three by three in my example.

It is a symmetric matrix which has variances of corresponding features on the diagonal.

As I said previously,

this matrix is all we need to find those directions,

projection on which the variance is preserved in the most effective way.

To do it, we rely on the so-called eigenvector decomposition of a matrix.

This is a mathematical technique that can always be

applied to a symmetric, non-negatively defined matrix.

Effectively, it decomposes the matrix in a product

of three matrices the way it is shown in the slide.

You can see, that the third matrix is just a transposition of the first

one and the second matrix has a particularly simple diagonal structure.

The values of numbers on the diagonal are always non-negative.

And I suggest rearranging them in non-increasing order.

So number one is the maximum among them and number N is the minimum.

The colons of the first matrix are called eigenvectors or principal components.

While value of the corresponding numbers are called eigenvalues.

For our task, the principal components are exactly those directions we are looking for.

Let's take the first principal component.

Turns out that if we project all the data in

this direction and calculate the variance of the projection,

we are going to get exactly number one.

Moreover, you can't find

a better direction projecting on which you'd get a better variance.

Go back to our picture,

X_1 prime and X_2 prime are exactly the

first and the second principle components that this method finds for us.

Additional property of these vectors is that they are diagonal and thus,

forming your bases that we can consider instead of the initial one.

So, now we are ready to approach

the task of dimensionality reduction in the following way.

First, we get principal components applying this cause technique.

Then, we just take a few first principal components

and project everything on the subset that defined.

By the way, it's exactly what we did in our small example.

And so, the only question left is how many components to pick?

To answer this question,

we need to see how much variance out of

the total is going to be left after the projection.

This value is usually called explained variance.

Total variance is the sum of all numbers

while they explained variance is the sum of selected ones.

Of course, it makes sense to pick the components with larger numbers first.

For example, we might want to take a subset that explain at least 80% of

variance or just stop at the point when they explain variance starts growing rather slow.

If you take a look at the graph that shows

how not explained variance depends on number of components.

Typically, it resembles a hand of a person stretching from shoulder to wrist.

There is often a point where the slope of the graph changes.

This points correspond to an elbow in

our imaginary hand and picking the number of components called an elbow method.

Important note on PCA is that,

it's better applied to standardized data.

It effectively means that we apply

decomposition to correlation matrix rather than covariance matrix.

By standardization, we make features to be of

comparable scale which eliminates biases towards the features with large absolute values.

So, in this lesson,

we've learned about one of the approaches to the task of dimensionality reduction.

The approach is called PCA and it tries to preserve the total variance in the data.

The method identified this set of new directions called principal components.

The dimensions are reduced by simply picking the subset of these components.

And this procedure is guaranteed to preserve variance in the most effective way.

The last thing that I want to note is that there are other procedures for

dimensionality reduction that strive to preserve

other properties of the data like relative prioritize distance between the points.

There is no other right method and the choice of technique depends on your final goal.