0:15

Hello and welcome to the Gaussian Process lesson.

Gaussian processes is a machine learning algorithm that uses a set

of Gaussian or normal distributions to

infer an underlying functional relationship among a set of data.

This is a little bit more complex

mathematically than some of the other algorithms that we've seen.

But it's a very popular,

very new and useful algorithm.

I wanted to include it in the course.

And you can actually understand a lot of what this algorithm is trying to do visually,

by seeing the plots of the actual underlying function,

the sample data, as well as the fitted function.

And this makes it a little bit easier to

introduce than simply diving straight into the mathematics.

So by the end of this lesson, what you should

understand the basic idea of a Gaussian Process.

You should be able to communicate that to somebody else.

I want you to understand that probabilistic underpinning behind the Gaussian Process.

And you should be able to apply a Gaussian Process by using the scikit-learn library.

There are two readings for this particular lesson as well as the course notebook.

So let's jump straight into the first lesson.

The first one is by a well-known Pythonista, Chris Fonnesbeck.

He is actually the head developer,

creator of a popular probabilistic programming language

called PyMC3 that will actually use in accounting 571.

This talks about fitting Gaussian Processes in Python,

and he demonstrates three different ways to do it.

I only want you to look at the first way.

If you do skim through this idea,

of course, this is some pretty complex math.

And so I'm not too concerned about you understanding it,

but you might be able to follow along.

It's actually not as complex as it looks.

But the main thing I want you to do to get out of this is to look at this idea.

First he talks about, just if you are

doing some of the stuff with Python what we would do.

But then he's actually going to get into demonstrating how to

apply Gaussian Processes by using the scikit-learn library.

And that of course is what we're interested in.

He also demonstrates two other ways to do that.

So here, I want you to look at the scikit-learn example, that's the first part.

You could ignore the GPflow in the PyMC3.

We'll actually look at the PyMC3 version as I said in accounting 571.

So, the second reading is this.

It's actually a little web page that talks about Gaussian Process Regression,

and the idea of trying to understand it.

So when you look at this, if you go to this web page,

you're just going to see this curve hopping around.

And what it is, is it's an attempt to perform

Gaussian Process Regression on a set of data,

but we haven't actually put any data points in.

There's some sliders down here that you can

use to control how crazy the function is behaving.

So if I slide this up, you can see it gets a little more crazy,

you can increase the noise,

you can also change your characteristic length scale,

and she is just going crazy.

So what does this have to do with Gaussian Processes?

Well, right now there's no data to constrict to fit.

It's just as you can see, flopping around.

But as soon as I put in one data point,

let's say I go right here,

you notice it's changed.

What's happened is that fit now has to go near this point, right?

There's a slight probabilistic fitting going on here,

so it can vacillate around that point a little bit,

but that's a restriction on where

the fit has to lie in a particular part of parameter space.

So now I come down and I get say, well,

let's come over here and let's put another point right here.

And now you've noticed that now,

we have two points that are restricting the fit of our data.

And not only that, you notice that here,

the black line is the actual fit that we've estimated our original signal at,

and then these are the confidence interval limits.

I can continue to put some points out here,

and I can put on the point out here,

and maybe another one here.

And you can see as I continue to add points,

the fit gets more and more constrained.

And so, here I have just, what?

Six points, seven points.

And you could see that they've already greatly constrained that function.

It was flopping around in all of this space.

And I add a few more points.

You could see that pretty soon this function is very well constrained,

and we've defined exactly where

the fit function should be as well as a confidence interval.

That's a really powerful example of this entire fitting process.

One other thing I want to mention before I leave this particular site.

Notice how at the ends where there's no data points constraining the fit,

is where it still has the greatest ability to vacillate or move around.

This is a characteristic example of

this challenge we've talked about before of interpolation,

where we have data points,

and we're trying to say what's the value of the function between the data points and

extrapolation where we're saying what's

the value of the function past the limits of our data?

And there's very little information out here so there's very little

to constrain the actual function that we're looking at.

The last thing of course is the notebook,

and this notebook is going to introduce Gaussian Processes.

This is a very powerful probabilistic technique.

It's based on the idea of Kernel functions that were going to be

using to approximate the actual underlying signal.

So first we have to talk about those.

Then we're going to use the Iris data set to introduce classification.

We'll demonstrate decision surfaces with Gaussian Processes as

well as hyper parameters and their effect on the overall fitting process.

We'll then see an example of a more complex dataset being used for classification.

And we'll also look then at regression and how

Gaussian Processes can be used to estimate a continuous value.

And for that, we will use the auto MPG data.

The rest of this notebook just walks through this.

The only thing I really wanted to emphasize here was

this idea of the Gaussian Process and how it's fitting.

This is now showing exactly what we saw on that website but in a static plot.

Here now is our data points,

and you can see here's our functional fit in the blue line.

The actual signal that we're trying to fit is this purple dashed line.

And you see that this doesn't do a very good job and we only have two observations.

We've fit the data at the endpoints but we don't do very well in between.

As soon as we start adding some more points however,

the function does a much better job.

We're starting to zero in on the actual fit.

And as we add more and more,

you can see that it starts getting better.

Realize, this is only eight observations.

That's not very much data.

And as soon as we go to 12,

it's doing quite well, right?

So you could see of how quickly you can start to approximate

a function with very little data from Gaussian Processes.

And that demonstrates one of the reasons that people are very excited about using

them is because it doesn't take a lot of data to both

get a reasonable model for your underlying signal,

as well as the fact you have these confidence intervals

that sort of tell you the limits of your knowledge.

The next thing was Kernel functions.

The idea that we have to employ

Kernel functions in order to approximate this underlying signal.

There's a number of different ones that can that can be used.

The scikit-learn library provides a number of them that you can actually apply.

So we talk a little bit about them.

There's a ConstantKernel, Sum

kernel that allows you to combine different Kernel functions,

Product which is multiplying two different Kernel functions,

there is a Kernel that allows you to

include something that estimates the noise in the signal,

there's a Radial Basis Function,

this is something we've seen before,

it's a non-linear function that allows us to actually get non-linear features.

And then there's a Matern Kernel,

which is a generalization of that RBF.

These are discussed in more detail

online in scikit-learn library and you can see it there.

The rest of this notebook simply walks through

applying Gaussian Processes to first the Iris data set.

So we can see that. See, our accuracy is pretty good.

Right off the bat. And then we can see our confusion matrix again.

Pretty good. We can then look at the decision surface.

And remember that we have to employ one of these kernels,

and so we can get something that's quite non-linear,

which this example shows.

We can then change hyper-parameters.

And for this, we're actually going to use different Kernels for our hyper-parameter.

And so here we're defining different Kernels.

You can see that we have an RBF Kernel used for the first one.

That's a isotropic RBF because we do not specify a different length scales.

The second one is an un-isotropic,

because these are different values of the length scales.

So you could think of this as an ellipse,

and one of them is a long axis,

the first one, and the second one is a shorter axis.

So it's a very narrow cigar shaped ellipse that may have

a quadratic function along with a Constant Kernel and dot product.

And then lastly, a Matern Kernel.

The details of these are less important.

I'm not asking you to become an expert in Gaussian Processes,

I do want you to be aware of

the richness of the Kernels that you can apply to this particular problem.

Then we just compute these and we apply and as you'll see as we

go through these decision surfaces, they're very different.

Particularly, from what we've seen before.

So this is the very first one we did,

the isotropic, radial basis function.

This is an un-isotropic radial basis function.

And then you notice how it's suddenly started to classify

the bottom as the same as the top. It's very different.

And then again, notice how nonlinear these decision surfaces are?

Then we have the dot product one.

And then lastly, we have Matern Kernel.

So again, changing the parameters changes the ability of

the algorithm to make classifications or to

perform regressions as the rest of this notebook will show.

You can see that that kernel has a lot of input into how the algorithm actually operates.

So with that, I'm going to go ahead and stop,

allow you to read through the material that we

introduced as well as run through this notebook.

This is a very powerful algorithm and something you'd want to be aware of,

although I don't expect you to in this course, become an expert.

It's the sort of thing that's very useful to you to be aware of,

so that if you're working in a team and people start talking about techniques,

you'll be familiar with them and able to converse and talk about

their strengths and weaknesses and why you might want to use it for a particular problem.

That I'm going to go ahead and stop.

If you have any questions let us know in the course forums and good luck.