0:00

So far we've talked about MRF and CRF learning purely in the context of maximum

likely destination where our goal is to optimize value likelihood function or the

conditional likelihood function. But as for Bayesian networks maximum

likely destination is not a particularly good regiment and it's very susceptible

to over-fitting of the parameters to the specifics of the training data.

So what we'd like to do is we'd like to utilize some ideas that we, exploited

also in the context of Bayesian network. Which are, ideas such as parameter priors

to soothe out our estimates of the parameters, at least in initial phases.

Before we have a lot of data to really drive us into the right region of the

space. So in the context of Bayesian networks

that was all great, because we could have a conjucate prior, on, on the parameters,

such as the derscht laid prior that we could then, integrate in with a

likelihood to obtain a close form conjucate posterior and it was all

computationally elegant. But in the context of MRF's and CRF's,

even the likelihood itself is not computationally elegant and can't be

maintained in closed form. And so, therefore the posterior is also

not something that's going to be computationally elegant.

And so the question is, how do we then incorporate ideas such as priors into MRF

and CRF learning, so as to get some of the benefits of regularization?

So the ideal here in this context, is to use what's called map estimation, where

we have a prior, but we're instead of maintaining a postiariary is close form,

we're computing what's called the maximum, postierory estimates of the

parameters. And, this in fact is the same notion of map, that we saw when we did

map inferencing graphical models, where we were computing a single map

assignment. Here continue in this thread of Bayesian learning as of form inference

we're computing a map, inference estimate of the parameters.

So concretely how is map inference implemented in the context of an MRF or

CRF learning. A very typical solution, is to define a,

Gaussian distribution over each parameter theta i separately, And that is usually a

zero mean, uni-variant Gaussian.

[SOUND] with some variance. sigma^2.

And, the variance of sigma squared dictates how firmly we believe that the

parameter is close to zero. So for, small variances we are, very

confident that the parameter is close to zero and are going to be unlikely to be

swayed by a limited amount as data, whereas sigma gets larger we're going to

be more inclined to believe, believe the data early on and move the parameter away

from zero. So we have such a parameter prior over

each data i separately, and they're multiplied together to give us a joint

parameter prior. So the parameter prior over each is going

to, over each parameter is going to look like this.

This sigma^22. is called a hyperparameter.

And it's exactly the same kind of beast as we had for the Dirichlet

hyperparameters in the context of learning these in the works.

The alternative prior that's also in common use, is what's called the

Laplacian Parameter Prior. And the Laplacian Parameter Prior looks

kind of similar to the Gaussian, in that, it has an exponential that, increases as

the parameter moves away from zero. But, in this case, the increase in the

parameter depends on the absolute value of theta i and not on theta i^22, which

is what, the behavior that we would have with the Gaussian.

And so this function looks, looks as we see over here, with a much sharper peak,

around zero, that effectively corresponds to a discontinuity, at theta i equals,

equals zero. And, we have again such a prior, a

Laplacian prior to theta i over each of the parameters, of theta i which are

multiplied together. Just like the Gaussian, this distribution

has a hyperparameter. Which in this case is often called beta.

And the hyperparameter just like the variance, of, in the Gaussian

distribution Sigma squared, dictates how tight this distribution is around zero.

Where tighter distributions correspond to cases where the model is going to be less

inclined to move away from zero, based on limited amount of data.

So now, let's consider what map estimation would look like in the context

of these two distributions. So here we'd have these two parameters

priors rewritten, the Gaussian and the Laplacian.

And, now map estimation corresponds to the arg max over theta of the joint

distribution, P of D comma theta, so we're trying to maximize, oh, we're

trying to find the, theta that maximizes this joint distribution.

And by the simple rules of, probability theory, this joint distribution is the

product of P of D of given theta which is our likelihood.

5:25

Multiplied by, the prior p theta. And, because of the linearity of the, not

linearity, sorry, the lognicity of the logarithm function, we have the, this is

the same as doing arc max of theta, of our log likelihood function.

Which is just the log of the likelihood. And the log of, the parameter prior.

So now looks at what these logarithms look like in the context of these two

priors. And so this is, I'm actually drawing the

negative log of p of theta and the negative log of p of theta for a Gaussian

Distribution is simply a quadratic. And so we have a quadratic penalty that

pushes the parameters towards zero. And this is the same for those of you

who've seen this in the machine learning class, for example as doing one

regularization. Over the parameter theta i.

The distribution that, if you now do the same for the Laplacian distribution,

we're going to get. the penalty that pushes the parameter

towards zero in a way that depends on the absolute value of the parameter.

And, this is equivalent to L1 regularization which again, some of you

might have seen in a machine learning class.

So this is a linear penalty. Now these penalties although they both

push the parameters towards zero. Are actually quite different in terms of

the behavior. So if we look at what l two does.

In, when the parameters far away from zero, sort of towards the edges.

The quadratic penalty's actually quite severe and it pushes the parameter down

dramatically. But as the parameter gets small, so

closer to the case of theta I equals zero, the pressure to move down

decreases. So, effectively you're, it's quite

legitimate for the model to have a lot of parameters that are close to zero but not

quite zero and there isn't a lot of push at the point to get the parameters to hit

to zero exactly. And, so that means that models that use

L2 regularization are going to be dense. That is, a lot of the theta I's are going

to be non-zero. Conversely when you look at the L1

regularization you see that there's a consistent push, towards zero regardless

of where in the parameter space we are. And so it's in the interest of the model

to continue pushing parameters that don't impact significantly, the log likelihood

component and push them towards zero. And so for L1 regularization we're going

to get models that are sparse, and sparsity has it's own value.

Because the sparser the model the easier. It is to the inferences in general,

because sparsity corresponds to, in principle removing edges or removing

features from the graphical model. Which in turn can remove edges, and make

inference more efficient. And so L1 regularization or the

corresponding Laplacian penalty is often used to make the model both more

comprehensible as well as computationally more efficient.

So to summarize, when we're doing learning in undirected models.

The parameter coupling that we have in the likelihood function prevents us from

doing Bayesian destination efficiency. That is maintaining a posterior

distribution over parameters. However, we can still use the parameter

as a notion of parameter priors to avoid the over fitting that we would get with

maximum likelihood destination. And specifically, we're going to use map,

as a substitute for MLE. The most typical priors used in this

context are L1 and L2, or Gaussian and Laplacian And both of these drive the

parameters towards zero which has the effect of preventing the model from going

wild and using very drastic parameter values that allow us to over fit the

specific training data that we've seen. Now while this behavior is shared among

these two models there's also a significant difference between them in

that the L1 or Laplacian distribution can be shown both empirically and provably to

induce sparse solutions and therefore it performs for us a form of feature

selection or equivalently structured learning in the context of MRFs and CRFs.