So we've seen how Gaussian process can be used for regression classification problems. Now let's see how they can be used for optimization problems. So, what is a Black-box optimization? You have some function f(x) you don't know anything about it and you want to find out what is the value of the maximum of this function. In the cases when the gradient is known, you can use a gradient descent with restarts. In the case when gradient is unknown you can, for example, numerically estimate the gradient or if the function is not differential you can use grid search or random search. However, there is one special case of optimization problems when computing each point is really expensive. For example, you want to find out where the oil is and you should do some probe drilling to estimate the amount of oil in some points. So X in this case would be geographic coordinates and F of X would be for example the amount of oil. In this case the cost of the sampled cost would be like $1 million. If you would like to estimate the hyper parameters of the neural network in this case the X would be versus hyperparameters for example the size of the layers, the number of layers and so on. And F of X would be some objective function. In this case the cost of one sample would be 10 hours. Since you would have to wait for this time until the network will converge. In the drug discovery, it is even more important. The X could be the drug and F of X would be the effect of this against some severe disease. In this case one sample, that is one test of the proposed drug, would cost you 2 months of clinical trials, $10,000 for example for paying for assistance and also maybe a life or a rat if you do experiments on the rats. So these are cases when the cost of evaluating a function at any point is really, really expensive. And we will see how Gaussian process can be used to find the maximum of the function by minimizing the number of samples that you have to make. So again, our goal is to optimize a function that is to find its maximum and also to minimize the number of trials. The key here is to define a surrogate model. A surrogate model, we'll call it Fhat, shouldn't be approximately equal to the original function. However, we want it to be cheap to evaluate and also it should approximate the function well. In this case we'd be able to find the maximum value for example the surrogate model and then to sample the original function at this point. We'll also need an acquisition function. We will call it mewe of X. It will show how likely for us it is to sample least for that point. We should estimate the profit for optimization from sampling the point X. And it should use a surrogate model since the surrogate model is really cheap to evaluate. So first of all, let's see which surrogate model we will use. The model should be able to approximate arbitrary complex functions. So the natural choice here would be a nonparametric model. Also it should be profitable to estimate uncertainty so the Gaussian process is a really good choice for this case. Now let's see what we need from an acquisition function. It should balance between exploration and exploitation. The exploration is that you search in regions with higher uncertainty. Since the uncertainty there is high, there is some chance that you'll get a new maximum layer. Exploitation is searching in the regions where the value is already estimated to be high. There may be low uncertainty, but you will still be able to compute the value of the position of the maximum more precisely. So, now let's see three different kinds of acquisition function. And the first is the maximum probability of improvement or MPI for short. So, we have the current best value. Let's write it down as f*. And the acquisition function would be the probability that after sampling the point X, that the value of this function would be greater or equal to the current best value. So this is a probability that Fhat, the surrogate model, would be a greater or equal to f*, the current best, plus some parameter X1. So, since everything is normal, it is really easy to compute this value. And it can be written using the cumulative objective function of the normal distribution as shown on the right. Another one is called an upper confidence bound. And that is you try to take the mean value and add etha times the variance of the surrogate model of this point. So, for example you can sample at points where there is high mean or the points where there is high variance by varying the value of etha. The third acquisition function that we will see is called an Expected Improvement. As the title suggests, you just compute the expected value of the gain that you'll get after sampling the new point. Also, again, since everything is normal you can compute an articulate and the formula would be like this. So let's see an example. We start with some few points, in this case three points. And while the process is in converge you train the Gaussian process. You find the maximum of an acquisition function for example using the gradient descent or some other optimization techniques. And since computing the values of the surrogate model, the Gaussian process are relatively cheap, this process won't take much time. And then you evaluate the function of the position of the maximum of an acquisition function. Then again you train the Gaussian process with new point, you find you find the maximum again, values at the new point, and so on. So let's see how it works. So again here I have three points. Now I add a new one and you see below this is the plot of an acquisition function. And we'll find the maximum fit. It would be somewhere on the right. And now you can see that the points on the left have a relatively small value of the acquisition function. So the mean is quite low relative to the current best point and also the variance is small. And it is very unlikely that we will ever be able to get the point that has a value greater than the current optimum. And so we'll sample near the current optimum for the rest of the experiment. So as you see we change our function a bit and now we're almost sure that the position of the maximum is somewhere near this point. And so, finally, after sampling a few points, we find the position of the maximum. So, it is very important to compare random search in Gaussian process. For random search, it is really easy to parallelize it. For example, it is really easy to sample some random points and then compute the value of the function then. However for Gaussian process, you have to come up with some heuristics to sample not one point but many. And why heuristics is to take the maximum of an acquisition function and replace and then add a new point to the Gaussian process at this point with the value being the mean of the Gaussian process. Then we compute the maximum of the acquisition function again and take the maximum value of it again, and repeat it as many times as you want. And so you'll have a set of points and you'll be able to then sample them in parallel. Now the random search actually needs many more points on average to converge, and also it works poorly in high dimensional spaces. For example, in the plot below you can see that the Gaussian process converges much faster than the random search, and also it converges at the better optimum. Also the random search works for any function since sampling the random X is usually quite cheap. However, it is important to know that using Bayesian optimization with Gaussian process is useful only when each relation of the function is expensive since training the Gaussian process also takes some time. And if it takes more time than evaluating the true function, then you'd better use the random search.