0:15

Hello and welcome to this lesson which we'll introduce manifold learning.

Manifold learning is similar to

dimensional reduction and that it is trying to reduce the dimensionality of the dataset.

However, in this case,

we're trying to determine a manifold or

a high-dimensional curve that captures the signal in

the data and then to map

that high-dimensional curve or manifold

to a much lower-dimensional manifold that can be visualized.

Typically, these algorithms are going to be used

to enable visualization of high-dimensional data,

and we'll see that in the Notebook that I demonstrate this to you.

So, in this lesson, I really want you to understand what

manifold learning is and why it can be an effective tool.

You should be able to explain the differences between

the primary manifold learning algorithms we are going to use,

including LLE, Isomap, and the t-SNE algorithm.

You should be able to apply manifold learning to do

this two-dimensional visualization of these high-dimensional datasets.

And lastly, there's another related concept that I'm

i going to introduce called FeatureUnion,

which allows us to combine the feature selection done

by different algorithms into a do set of features.

So, for instance, we might want to apply PCA and select only a few components,

and we also might want to select recursive feature lamination.

And that way, we can see different techniques to try to

build a more representative set of features.

And so, we'll show how that works.

The readings or activities for this particular lesson include a reading,

the Notebook Manifold Learning Project, by Jake VanderPlas.

You can skip the example section.

And then, there's the, of course, Notebook.

So, just a little fast, the discussion here.

He does a very nice job discussing what

manifolds are in these manifold learning techniques,

uses a special type of data,

which is called Hello. And you'll see it here.

You can see its points that are colored showing you the word Hello.

Obviously, this is a nonlinear representation of the data.

And so, you can apply MDS algorithms to try to capture this signal.

And so, he shows you this and how it can all be computed

and then goes through different algorithms and tries to recover that distribution.

In our Notebook, we're going to focus on looking at the digit dataset.

We're going to show how PCA with just two components can

sort of provide a representative view or visualization of this data.

And then, we're going to walk through several different algorithms

including LLE or locally linear embedding,

multidimensional scaling, isometric mapping or ISO map,

and then the t-SNE algorithm.

We're going to see all of them can generate

two-dimensional representations of this high-dimensional dataset.

Remember, the handwritten digit dataset is actually

64 dimensions since they are eight by eight pixels,

and we're got to see how well this actually separates the data out.

Can we recover the original signal?

Remember, this is an unsupervised learning algorithm.

So we don't use labels in the actual generation of the output.

We simply use the labels at the end to determine did we do a good job of taking

this high-dimensional space where these digit data were located and keep them together,

because that's the fundamental idea that you're going to generate

a mapping from this high-dimensional space to this low-dimensional, say,

two-dimensional space, where points that are near each other in

this high-dimensional space remain near each other in the low-dimensional space.

And then, lastly, we'll look at feature unions.

So, again, we start the Notebook app the same way. We read in our data.

We've got a code here that plots these digit data,

and we'll see what this plot looks like in just a second.

Basically, we're just trying to represent

the different data in a way that makes it easy to distinguish them.

So, we load our data. There is the digit dataset.

I've now inverted it so that it's black on white,

just to sort of give a different view.

We're going to be using 25 neighbors,.

You can, of course, change this value and see how it impacts the rest of the Notebook.

So, this now is what that plot method does.

You can see that it plots all of the digit dataset.

It colors each class differently.

The algorithms themselves don't see this labeled data.

But when you apply PCA,

you can see that the light blue or zero digits stay fairly well-clustered.

Six is fairly well-isolated as well,

two and three seem to be as well.

But some of them like five,

which is this red dot,

reds are all over the place.

They're kind of intermixed. If you think about that,

it should make some sense.

Zero is different than the rest of the numbers.

It's pretty easy to distinguish that.

Three and two, they're different as well.

Five looks a lot like all of these others that it's mixed in here with.

And so, that's part of the challenge.

Can we separate those out in a way that we're able to pull that information out?

So, one last thing I want to complete here.

When you see the number five,

this is actually the number five displayed on

the plot at the exact center of the distribution of its data.

So, the zero, this is the exact center of the distribution of these light blue points,

same with six, four, et cetera.

So, this gives you a feel for where the centers of the individual clusters

of points that represent these different digits are located.

So, let's take a look at the output of the rest of these algorithms.

I'm not going to talk in this video about these algorithms.

You can read about them. I just want to look at what that plot looks like.

So the first one ear is LLE.

And if you change the parameters,

it changes the mapping.

And here you notice that they're not

nice spherical clusters. They're all different kinds.

So here you can see seven is this long spike.

Six is this long spike.

Zeroes over here are very tight little spike.

So, in this particular choice of hyperparameters,

you can see that six,

and even eight, two degrees, zero,

four, and seven, those are well-separated, but one, two,

three, five, and nine are all really clumped together tightly.

So, LLE didn't do a great job of pulling those up,

but it did do a nice job with

this particular set of hyperparameters of pulling out these.

As you tune those hyperparameters,

you might see different results.

Let's take a look at the next algorithm, MDS.

MDS does a different job.

You notice that the overall distribution of points is

sort of spherical or circular in this two-dimensional space.

That zero, six, and four are nicely pulled out.

Three and two, nicely pulled out.

The nine is not too bad,

but fives are all over the place again.

And ones aren't too bad as well.

So you see that it did a better job of PCA and

the first algorithm LLE in terms of actually separating out the clusters,

but there are still some, like particularly five,

that's intermixed with the rest.

We can look at Isomap,

which is actually a form of MDS, and see what it does.

And here, interestingly enough,

you can now start to see four, six,

zero is also those are pulled out, well-separated.

Two and three, not too badly separated, seven, not too bad.

And remember, this is a 64-dimensional space

that we've now projected down to two dimensions.

And so, that's actually pretty impressive what this algorithm has done.

The last one I'll look at is t-SNE.

t-SNE is actually one that some people just get blown away by how well it does.

It's very powerful at taking

a very high-dimensional space and generating this two-dimensional visualization.

The challenge is it's much more computationally complex.

And so, a lot of times people will do, say,

PCA initially to take a very high-dimensional space and reduce it down to, say,

50 or 100 dimensions and then use

t-SNE to visualize that because it's going to be much faster.

So let's take a look at t-SNE on that digit dataset.

And here you understand now what I'm talking about.

Look at the zeroes. The zeroes are all pulled out.

The nines are pulled out.

There's just these little clumps of them remaining,

six, four, et cetera.

Two, nicely pulled out. Three, nicely pulled out.

And even where the data are not part of the main cluster,

they're still clustered themselves.

So, it does a nice job of maintaining that clustering of the data.

And again, this is just the default hyperparameters.

We haven't really tried to do a lot of tuning here,

and you could still see that it's done

a really amazing job of separating these clusters out.

We wouldn't expect perfect separation I mean.

Fives and sixes sometimes look alike,

and you can see there's a five over here.

And the ones are sort of spread out a little bit.

You can see that they're not very well-bunched.

But then, again, a one can look like a seven,

and it can maybe look like a five or something.

So you can understand where some of these are occurring.

But again, if you think about it,

this algorithm separated these data with no guidance from us.

We simply told it there's a high-dimensional space,

generate a two-dimensional space,

and preserve that geometry in the high-dimensional space so

that points stay together when they're down here in this lower-dimensional space.

The last thing, of course, was this feature unions.

This Notebook walks through how to apply this.

I'm not a talk a lot about this in the video.

It's a nice way of taking different techniques and combining them together to

hopefully end up with a better set

of features than you might have gotten with just one technique.

So, I'm going to go ahead and stop the video with

that particular mention of feature unions.

If you have any questions, let us know.

And of course, good luck.