[MUSIC] In this video, we'll see different tricks that are very useful for training Gaussian process. And the first one is, what you should do when you see noisy observations? As youremember the mean would go exactly through the data points, and also the covariance, and actually variance would be zero at the data points. And if you fit the Gaussian process to a noisy signal like this you will get this quickly changing function. But actually you can see that there is something like a parabola here, and also some noise component. So let's modify our model in a way that it will have some notion of the noise in the data. The simplest way to do this is to add the independent Gaussian noise to all random variables. So have some new random variable f hat. It would be equal to the original random process f of x plus some new independent Gaussian noise. This means that, we'll independently sample it for each point of our space rd, that is different axis. We'll say that the mean of the noise is 0, and the variance would be s squared. In this case, the mean of the random process was to be 0. Since we have the sum of two means, the mean of f of x and epsilon, and those sum up to 0. And the covariance would change in the following way. The new covariance would be the old covariance K of xi minus xj, plus s squared, the variance of the noise. As an indicator that the points xi and xj are the same. This happened since there's no covariance between the noise samples in different positions. And if we fit the model using this covariance matrix using this kernel, we will get the following result. So as you can see, we don't have the 0 variance data points anymore. And also the mean function became a bit smoother. However, this still isn't the best we can have. We can change the parameters of the kernel a bit, and find the optimal values for them, in this special case. If for example we have the length scale equal to 0.01, the covariance will drop really quickly to 0 as we move away from the points. And so the prediction would look like this. And this is like the complete garbage. If we take the length scale to be equal to 10, then it would be too high. And the prediction will change through really slowly. And it would be 0 almost however and the variance would be like two the basic the variance of the prior process. So here we select the l to be 2, somewhere in the middle, and we'll have the process like this. It actually has some drawbacks, since as you can see, at the position -3 and 3, the process could create and starts to reverse its prediction to 0. So maybe we could change some other parameters a bit better, like just letting sigma squared or x squared. And maybe we'll be able to feed Gaussian processing better and actually it turns out that we can so this automatically. We can have all our parameters for the Gaussian kernel will have sigma squared parameter, l, and s squared, so three parameters. We're going to tune them by maximizing the likelihood. So we take our data points, f of x1, f of x2 sub 1, f of xn and maximize the probability of this data process to observe given the parameters. Since everything is Gaussian for Gaussian process, it will be also Gaussian with mean 0 and the covariance matrix c as we have seen in the previous video. If you write down what the probability just fraction is equal to read carefully. You will see that you can optimize this value using simply the gradient ascent. And using this you will be able to automatically find the optimal values for the variance sigma squared, the variance s squared, and the length scale parameter l. So if you run this procedure, you will get something like this. So we estimated l to be 2, which actually is true. However, we spent some time doing it by hand, and this value was selected automatically. Also, we were able to estimate that the variance of the process should be 46.4 and the variance of the noise should be 0.7. As you can see, also on the boundaries, the prediction became a bit better. So the process doesn't reverse it's direction very quickly. Let's see how the fitting of this process works for different data points. In this case, I tried to fit the Gaussian process simply into a noise. In this case, the Gaussian process estimated that the s squared parameter versus the noise should be 0.79. It really believes that all the data that I gave him is signal noise. If I try fit a Gaussian process into a data that I sampled without noise, it will quickly understand it and will put the noise variance parameter to almost 0. So in this case it was like 5 times 10 to the power of -17, which is actually really close to 0. If however I have the process that has some signal but it also has noise, you automatically find that the noise variance should be somewhere in-between 0 and some larger variables. In this case it estimated to be 0.13. All right, now let's see how Gaussian process can be applied to classification problems. Previously we saw how they can be use for regression, and for classification it is a bit harder. So have two labels, two possible labels plus 1 or -1. We can use latent process f of x, this will show something like, how sure we are in predicting this or that label. And if we fit somehow the latent process f of x, we will be able to do predictions by passing the latent process through a sigmoid function. So the probability of the label y given f will be simply an 1 over 1 + exponent of -yf, which is the sigmoid function of the products y times f. So to train this model, you will first have to estimate the latent process. You'll have to estimate the probability of the latent process in some arbitrary points given the labels that we already know. So y1 of x1 for example could be plus 1, y of x could be -1, and so those are just binary decisions. So we estimated the latent process and then we could use it to compute the predictions. We could do this by marginalizing the general probability of labels and the latent process. This would be just the simple intro code, the probability of the label given the latent process. At time the probability of the latent process and it is integrated over all possible latent processes. So the mess here is a bit complex, so I'll skip it for now and lets just see how the condition works. So the first step as I said is estimation of the latent process. So in this case I have the latent points marked as crosses here, some have the value plus 1, some have the value -1. And if we fit the latent processes look like this. So as you can see, as we go to the area where all points have the labels +1, the values for the latent process would be positive. And for the negative examples the probability of the process will be negative. And here are our predictions, I just took the latent process and [INAUDIBLE]. So as you can see it is almost one in the positions where there are many positive points nearby. The same happens for the negative examples, and in the points where the targets change from plus 1 to -1 for example. The variance would be high and the prediction would be, won't be so certain in this points. So for example, somewhere in between -1 and -2, the value of the prediction would be somewhere around 0.5. It is almost absolutely not sure about the prediction. One last thing I want to tell you about, our inducing inputs. When you train a Gaussian process, it turns out that it is quite computationally expensive. If you have end points then computing the prediction will cost you an order of n cube, since you have to invert the covariance matrix. There is one simple idea called inducing inputs to speed up the Gaussian process. What you could is you could replace the dataset with a small number of data points. For example, those would be like support points in the SVM. And then you will try to fit the Gaussian process using only those points. You will select end points as inducing end points. The precomputing would cost us m squared times n, which is quite fast when we select small number of points. Computing the mean would be an order of m, which is almost instantly. And predicting the variance at h point will cost us an order of m squared. You can optimize the position, so in these primes and the division for them using maximal liked destination. So let's see how it works. Here I have 100 points and I fitted the Gaussian process using them. However here, I selected 10 points informally, and fit the Gaussian process into them. And as you can see, the values of the Gaussian process didn't change much. However, the cost of the prediction is much less, since we have 10 times less points than we had before. [MUSIC]