0:03

In the last video of this week,

Â let's discuss how can we apply Markov chain Monte Carlo to Bayesian Neural Networks.

Â So this is your usual neural network,

Â and it has weights on each h, right?

Â So each connection has some weights which would train

Â during basically fitting our neural network into data.

Â Basing neural networks instead of weights,

Â they have distributions and weights.

Â So we treat w, the weights,

Â as a latent variable,

Â and then to do predictions, we marginalize w out.

Â And this way, instead of just hard set failure for W11 like three,

Â we'll have a distribution on w in

Â posterior distribution which we'll use to obtain the predictions.

Â And so, to make a prediction for new data object

Â x or and though using

Â the training data set of objects X_train and Y_train, we do the following.

Â We say that this thing equals to integral where we

Â marginalize our w. So we consider all possible values for the weights w,

Â and we average the predictions with the respect to them.

Â So here you have p of y given x and w,

Â is your usual neural network output.

Â So, you have your image x, for example,

Â and you pass it through your neural network with parameters

Â w. And then you record these predictions.

Â And you do that for all possible values for

Â the parameters w. So there are infinitely many values for W,

Â and for each of them you pass your image through the corresponding neural network,

Â and write down the prediction.

Â And then you average all these predictions with weights,

Â where weights are the posterior distribution on w,

Â which basically says us, how probable is that these particular w was

Â according to the training data set.

Â So, you have kind of an infinitely large ensemble

Â of neural networks with all possible weights,

Â and with basically importance being proportional

Â to the posterior distribution w. And this is

Â full base in inference applied in neural networks,

Â and this way we can get some benefits from our produced programming in neural networks.

Â So again, estimate uncertainty,

Â we may tune some hyperparameters naturally and stuff like that.

Â And so we may notice here that this prediction,

Â this integral, equals to an expected value of your output from your neural network,

Â with respect to the posterior distribution w. So,

Â basically it's an expected output

Â of your neural network with weights defined by the posterior.

Â And so to solve this problem,

Â let's use your favorite Markov chain Monte Carlo procedure.

Â So let's approximate this expected failure with sampling,

Â for example with Gibbs sampling.

Â And if require a few samples from the posterior distribution w,

Â we can use that Ws,

Â that weights of neural network and then,

Â if we have like, for example, 10 samples.

Â For each sample is a neural network,

Â is a weights for some network.

Â And then for new image,

Â we can just pass it through all this 10 neural networks,

Â and then average their predictions to get approximation of

Â the full weight in inference with an integral.

Â And how can we sample from the posterior?

Â Well, we know it after normalization counts, as usually.

Â So, here this posterior distribution W is proportional to the likelihood,

Â so basically the prediction of a neural network on the training data

Â set with parameters W times the prior,

Â p of w which you can define as you wish,

Â for example, just a standard normal distribution.

Â And you have to divide by normalization constant, which you've done now.

Â But it's okay because Gibbs sampling doesn't care, right?

Â So it's a valid approach,

Â but I think the problem here is that

Â Gibbs sampling or Metropolis-Hastings sampling for that matter,

Â it depends on the whole data set to make its steps, right?

Â We discussed at the end of the previous video,

Â that sometimes Gibbs sampling is okay with using mini-batches to make moves,

Â but sometimes it's not.

Â And as far as I know,

Â in Bayesian neural networks,

Â it's not a good idea to use Gibbs sampling with the mini-batches.

Â So, we'll have to do something else.

Â If we don't want to, you know,

Â when we ran our Bayesian neural network on large data set,

Â we don't want to spend time proportional

Â to the size of the whole large data set or at each duration of training.

Â Want to avoid that. So let's see what else can we do.

Â And here comes the really nice idea of something called, Langevin Monte Carlo.

Â So it forces false.

Â Say, we want to sample from the posterior distribution p of w given some data.

Â So train in data X_train and Y_train.

Â Let's start from some initial value for the base w,

Â and then in iterations do updates like this.

Â So here, we update our w to be our previous w, plus epsilon,

Â which is kind of learning create,

Â times gradient of our logarithm of the posterior,

Â plus some random noise.

Â So the first part of this expression is actually

Â a usual gradient ascent applied to train the weights of your neural network.

Â And you can see it here clearly.

Â So if you look at your posterior p of w given data,

Â it will be proportional to logarithm of prior,

Â plus logarithm of the condition distribution,

Â p of y given x and w. And you can write it as

Â follows by using the purpose of logarithm that,

Â like logarithm of multiplication is sum of logarithms.

Â And you should also have a normalization constant, here is that.

Â But is a constant with respect to

Â our optimization problem so we don't care about it, right?

Â And on practice this first term,

Â the prior, if you took a logarithm of a standard to normal distribution for example,

Â it just gets some constant times the Euclidean norm of your weights

Â w. So it's your usual weight decay which people oftenly use in neural networks.

Â And the second term is, usual cross entropy.

Â Usual objective that people use to train neural networks.

Â So this particular update is actually a gradient descent or ascent

Â with step size epsilon applied to

Â your neural network to find the best possible values for parameters.

Â But on each iteration,

Â you add some Gaussian noise with variants being epsilon.

Â So proportional to your learning crate.

Â And if you do that, and if you choose your learning crate to be infinitely small,

Â you can prove that this procedure will eventually

Â generate your sample from the desired distribution, p of w given data.

Â So basically, if you omit the noise,

Â you will just have the usual gradient ascent.

Â And if you use infinitely small learning crate,

Â then you will definitely goes to just the local maximum around the current point, right?

Â But if you add the noise in each iteration,

Â theoretically you can end up in any point in the parameter space, like any point.

Â But of course, with more probability,

Â you will end up somewhere around the local maximum.

Â If you're doing that, you will actually a sample from a posterior distribution.

Â So you will end up in points with

Â high probability of more often than in points with low probability.

Â On practice, you will never use infinitely small learning crate, of course.

Â But one thing you can do about it is to correct this scheme with Metropolis-Hastings.

Â So you can say that theoretically,

Â I should use infinitely small learning crate.

Â I use not infinitely small but like .1,

Â so I have to correct, I'm standing from the wrong distribution.

Â And I can do Metropolis-Hastings correction to reject

Â some of the moves and then to guarantee that I will sample from the current distribution.

Â But, since we want to do

Â some large scale optimization here and to work with mini-batches,

Â we will not use this Metropolis- Hastings corrections because it's not scalable,

Â and we'll just use small learning crate and hope for the best.

Â So this way, we will not actually derive samples from the true posterior distribution w,

Â but will be close enough if your learning crate is small enough,

Â is close enough to the infinitely small, right?

Â So the overall scheme is false.

Â We initialized some ways of your neural network,

Â then we do a few iterations or epochs of your favorite SGD.

Â But on each iteration, you add some noise,

Â some Gaussian noise with a variance being equal to the learning crate, to your update.

Â And notice here also that you can't change learning crate at all,

Â at any stage of your symbolic or you will also break

Â the properties of this Langevin Monte Carlo idea.

Â And then after doing a few iterations like hundred of them,

Â you may say that, okay,

Â I believe that now I have already converged.

Â So, let's collect the full learning samples

Â and use them as actual samples from the posterior distribution.

Â That's the usual idea of Monte Carlo.

Â And then finally, for a new point you can just diverge the predictions

Â of your hundred slightly different neural networks

Â on these new objects to get the prediction for your object.

Â But this is really expensive, right?

Â So there is this really nice and cool idea that we can use

Â a separate neural network that will approximate the behavior of these in sample.

Â So we are simultaneously training these Bayesian neural network.

Â And simultaneously with that,

Â we're using its behavior to train a student neural network that will

Â try to mimic the behavior of this Bayesian neural network in the usual one.

Â And so it has quite a few details

Â there on how to do it efficiently, but it's really cool.

Â So if you're interested in these kind of things, check it out.

Â