0:00

In this video, I'm going to talk about the Bayesian interpretation of weight

penalties. In the full Bayesian approach, we try to

compute the posterior probability of every possible setting of the parameters of a

model. But there's a much reduced form of the

Bayesian approach, where we simply say, I'm going to look for the single set of

parameters that is the best compromise between fitting my prior beliefs about

what the parameters should be like, and fitting the data I've observed.

This is called Maximum alpha Posteriori learning and it gives us a nice

explanation of what's really going on, when we use weight decay to control the

capacity of a model. I'm now going to talk a bit about what's

really going on, when we minimize the squared error during supervised maximum

likelihood learning. Finding the weight vector that minimizes

the squared residuals, that is the differences between the target value and

the value predicted by the net, Is equivalent to finding a weight vector

that maximizes the log probability density of the correct answer.

1:14

In order to see this equivalence, we have to assume that the correct answer is

produced by adding Gaussian noise to the output of the neural net.

So the idea is, we make a prediction by first running the neural net on the input

to get the output. And then adding some Gaussain noise.

And then we ask, what's the probability that when we do that, we get the correct

answer? So the model's output is the center of a

Gaussian and what we're interested in is having the target value of high

probability under that Gaussian because the probability producing the value t,

given that the network gives an output of y is just the probability density of t

under a Gaussian centered at y. So the math looks like this: let's suppose

that the output of the neural net on training case c is yc and this output is

produced by applying the weights W to the input c. The probability that we'll get

the correct target value when we add Gaussian noise to that output yc is given

by a Gaussian centered on yc. So we're interested in the probability

density of the target value, under a Gaussian centered at the output of the

neural net. And on the right here, we have that

Gaussian distribution, with mean yc. We also have to assume some variance, and

that variance will be important later. If we now take logs and put in a minus

sign, We see that the negative log probability

density of the target value tc given that the network outputs yc, is a constant that

comes from the normalizing term of the Gaussian plus the log of that exponential

with the minus sign in which is just (tc2 - yc)^2 divided by twice the variance of

the Gaussian. So what you see is that, if our cost

function is the negative log probability of getting the right answer, that turns

into minimizing a squared distance. It's helpful to know that whenever you see

a squared error being minimized, you can make a probabilistic interpretation of

what's going on, and in that probabilistic interpretation, you'll be maximizing the

log probability under a Gausian. So the proper Bayesian approach, is to

find the full posterior distribution over all possible weight vectors.

4:06

If there's more than a handful of weights, that's hopelessly difficult when you have

a non-linear net. Bayesians have a lot of ways of

approximating this distribution, often using Monte Carlo methods.

But for the time being, let's try and do something simpler.

Let's just try to find the most probable weight vector.

So the single setting of the weights that's most probable given the prior

knowledge we have and given the data. So what we're going to try and do is find

an optimal value of W by starting with some random weight vector, and then

adjusting it in the direction that improves the probability of that weight

factor given the data. It will only be a local optimum.

4:55

Now, it's going to be easier to work in the log domain than in the probability

domain. So if we want to minimize a cost, we

better use negative log props. Just an aside about why we maximize sums

of log probabilities, or minimize sums of negative log probs,

What we really want to do is maximize the probability of the data, which is

maximizing the product of the probabilities of producing all the target

values that we observed on all the different training cases.

5:30

If we assume that the output errors on different cases are independent, we can

write that down as the product over all the training cases, of the probability of

producing the target value, tc, given the weights.

That is the product of the probability of producing tc, given the output that we're

going to get from out network, if we give it input c and it has weights W.

5:58

The log functions monotonic, and so it can't change where the maxima are.

So instead of maximizing a product of probabilities, we can maximize the sums of

log probabilities, and that typically works much better on a computer.

It's much more stable. So we maximize the log probability of the

data given the weights, which is simply maximizing the sum over all training cases

of the log probability of the output for that training case, given the input and

given the weights. In Maximum a Posteriori learning, we're

trying to find the set of weights that optimizes the trade off between fitting

our prior and fitting the data. So that's base theorem.

6:44

If we take negative logs to get a cost, we get that the negative log of the

probability of the weights given the data, is the negative log of the prior term, and

the negative log of the data term, and an extra term.

So, that last extra term, is an integral overall possible weight vectors.

And so that doesn't affect W. So we can ignore it when we're optimizing

W. The term that depends on the data is the

negative log probability is given W, and that's our normal error term.

7:21

And then term that only depends on W is the negative log probability of W under

its prior. Maximizing the log probability of a weight

is related to minimizing a squared distance, in just the same way as

maximizing the log probability of producing correct target value is related

to minimizing the square distance. So minimizing the squared weights is

equivalent to maximizing the log probability of the weights under a

zero-mean Gaussian prior. So here's a Gaussian.

It's got a mean zero, and we want to maximize the probability of the weights,

or the log probability of the weights. And to do that, we obviously want w to be

close to the mean zero. The equation for the Gaussian is just like

this, where the mean is zero so we don't have to put it in.

And the log probability of w is then the squared weights scaled by twice the

variance, plus a constant that comes from the normalizing term of the Gaussian.

And isn't affected when we change w. So finally we can get to the basing

interpretation of weight decay or weight penalties.

We're trying to minimize the negative log probability of the weights given in the

data and that involves minimizing a term that depends on by the turn the weights

namely how will we shift the targets and determine that depends only on the

weights. Is derived from the log probability of the

data given the weights, which if we assume Gaussian noise is added to the output of

the model to make the prediction, then that log probability is the squared

distance between the output of the net on the target value scaled by twice the

variance of that Gaussian noise. Similarly, if we assume we have a Gaussian

prior for the weights, the log probability of a weight under the prior is the squared

value of that weight scaled by twice the variance of the Gaussian prior.

9:40

So now let's take that equation and multiply through by two sigma squared D.

So, we got a new cost function. And the first term, when we multiply through turns

into simply the sum of all training cases of the squared difference between the

output of the net and the target. That's the squared error that we typically

minimize in the neural net. The second term now becomes, the ratio of

two variances times the sum of the squares of the weights.

And so what you see is, the ratio of those two variances is exactly the weight

penalty. So we initially thought of weight

penalties as just a number you make up to try and make things work better.

Where you fix the value of the weight penalty by using a validation set.

But now we see that if we make this Gaussian interpretation where we have a

Gaussian prior and we have a Gaussian model of the relation of the output of the

net to the target, Then the weight penalty is determined by

the variances of those Gaussians. It's just the ratio of those variances.

It's not an arbitrary thing at all within this framework.