[MUSIC]

In this video, we'll see different tricks that are very useful for

training Gaussian process.

And the first one is, what you should do when you see noisy observations?

As youremember the mean would go exactly through the data points, and

also the covariance, and actually variance would be zero at the data points.

And if you fit the Gaussian process to a noisy signal

like this you will get this quickly changing function.

But actually you can see that there is something like a parabola here,

and also some noise component.

So let's modify our model in a way that it will have some

notion of the noise in the data.

The simplest way to do this is to add the independent Gaussian

noise to all random variables.

So have some new random variable f hat.

It would be equal to the original random process f of x plus some new

independent Gaussian noise.

This means that, we'll independently sample it for

each point of our space rd, that is different axis.

We'll say that the mean of the noise is 0, and the variance would be s squared.

In this case, the mean of the random process was to be 0.

Since we have the sum of two means,

the mean of f of x and epsilon, and those sum up to 0.

And the covariance would change in the following way.

The new covariance would be the old covariance K of xi minus xj,

plus s squared, the variance of the noise.

As an indicator that the points xi and xj are the same.

This happened since there's no covariance between

the noise samples in different positions.

And if we fit the model using this covariance matrix using this kernel,

we will get the following result.

So as you can see, we don't have the 0 variance data points anymore.

And also the mean function became a bit smoother.

However, this still isn't the best we can have.

We can change the parameters of the kernel a bit, and

find the optimal values for them, in this special case.

If for example we have the length scale equal to 0.01,

the covariance will drop really quickly to 0 as we move away from the points.

And so the prediction would look like this.

And this is like the complete garbage.

If we take the length scale to be equal to 10, then it would be too high.

And the prediction will change through really slowly.

And it would be 0 almost however and

the variance would be like two the basic

the variance of the prior process.

So here we select the l to be 2, somewhere in the middle, and

we'll have the process like this.

It actually has some drawbacks, since as you can see, at the position -3 and

3, the process could create and starts to reverse its prediction to 0.

So maybe we could change some other parameters a bit better,

like just letting sigma squared or x squared.

And maybe we'll be able to feed Gaussian processing better and

actually it turns out that we can so this automatically.

We can have all our parameters for

the Gaussian kernel will have sigma squared parameter,

l, and s squared, so three parameters.

We're going to tune them by maximizing the likelihood.

So we take our data points, f of x1, f of x2 sub 1, f of xn and

maximize the probability of this data process to observe given the parameters.

Since everything is Gaussian for Gaussian process, it will be also Gaussian

with mean 0 and the covariance matrix c as we have seen in the previous video.

If you write down what the probability just fraction is equal to read carefully.

You will see that you can optimize this value using simply the gradient ascent.

And using this you will be able to automatically find the optimal values for

the variance sigma squared, the variance s squared, and the length scale parameter l.

So if you run this procedure, you will get something like this.

So we estimated l to be 2, which actually is true.

However, we spent some time doing it by hand, and

this value was selected automatically.

Also, we were able to estimate that the variance of the process

should be 46.4 and the variance of the noise should be 0.7.

As you can see, also on the boundaries, the prediction became a bit better.

So the process doesn't reverse it's direction very quickly.

Let's see how the fitting of this process works for different data points.

In this case, I tried to fit the Gaussian process simply into a noise.

In this case, the Gaussian process estimated that the s

squared parameter versus the noise should be 0.79.

It really believes that all the data

that I gave him is signal noise.

If I try fit a Gaussian process into a data that I sampled

without noise, it will quickly understand it and

will put the noise variance parameter to almost 0.

So in this case it was like 5 times 10 to the power of -17,

which is actually really close to 0.

If however I have the process that has some signal but it also has noise,

you automatically find that the noise variance should

be somewhere in-between 0 and some larger variables.

In this case it estimated to be 0.13.

All right, now let's see how Gaussian process can be

applied to classification problems.

Previously we saw how they can be use for regression, and for

classification it is a bit harder.

So have two labels, two possible labels plus 1 or -1.

We can use latent process f of x, this will show something like,

how sure we are in predicting this or that label.

And if we fit somehow the latent process f of x, we will be able to do

predictions by passing the latent process through a sigmoid function.

So the probability of the label y given f will be simply

an 1 over 1 + exponent of -yf,

which is the sigmoid function of the products y times f.

So to train this model, you will first have to estimate the latent process.

You'll have to estimate the probability of the latent process

in some arbitrary points given the labels that we already know.

So y1 of x1 for example could be plus 1,

y of x could be -1, and so those are just binary decisions.

So we estimated the latent process and

then we could use it to compute the predictions.

We could do this by marginalizing the general probability of labels and

the latent process.

This would be just the simple intro code,

the probability of the label given the latent process.

At time the probability of the latent process and

it is integrated over all possible latent processes.

So the mess here is a bit complex, so I'll skip it for

now and lets just see how the condition works.

So the first step as I said is estimation of the latent process.

So in this case I have the latent points marked as crosses here,

some have the value plus 1, some have the value -1.

And if we fit the latent processes look like this.

So as you can see, as we go to the area where all points have the labels +1,

the values for the latent process would be positive.

And for

the negative examples the probability of the process will be negative.

And here are our predictions,

I just took the latent process and [INAUDIBLE].

So as you can see it is almost one in the positions where there

are many positive points nearby.

The same happens for the negative examples, and

in the points where the targets change from plus 1 to -1 for example.

The variance would be high and the prediction would be,

won't be so certain in this points.

So for example, somewhere in between -1 and -2,

the value of the prediction would be somewhere around 0.5.

It is almost absolutely not sure about the prediction.