So far we've talked about MRF and CRF learning purely in the context of maximum

likely destination where our goal is to optimize value likelihood function or the

conditional likelihood function. But as for Bayesian networks maximum

likely destination is not a particularly good regiment and it's very susceptible

to over-fitting of the parameters to the specifics of the training data.

So what we'd like to do is we'd like to utilize some ideas that we, exploited

also in the context of Bayesian network. Which are, ideas such as parameter priors

to soothe out our estimates of the parameters, at least in initial phases.

Before we have a lot of data to really drive us into the right region of the

space. So in the context of Bayesian networks

that was all great, because we could have a conjugate prior, on, on the parameters,

such as the derscht laid prior that we could then, integrate in with a

likelihood to obtain a close form conjugate posterior and it was all

computationally elegant. But in the context of MRF's and CRF's,

even the likelihood itself is not computationally elegant and can't be

maintained in closed form. And so, therefore the posterior is also

not something that's going to be computationally elegant.

And so the question is, how do we then incorporate ideas such as priors into MRF

and CRF learning, so as to get some of the benefits of regularization?

So the ideal here in this context, is to use what's called map estimation, where

we have a prior, but we're instead of maintaining a postiariary is close form,

we're computing what's called the maximum, postierory estimates of the

parameters. And, this in fact is the same notion of map, that we saw when we did

map inferencing graphical models, where we were computing a single map

assignment. Here continue in this thread of Bayesian learning as of form inference

we're computing a map, inference estimate of the parameters.

So concretely how is map inference implemented in the context of an MRF or

CRF learning. A very typical solution, is to define a,

Gaussian distribution over each parameter theta i separately, And that is usually a

zero mean, uni-variant Gaussian.

[SOUND] with some variance. sigma^2.

And, the variance of sigma squared dictates how firmly we believe that the

parameter is close to zero. So for, small variances we are, very

confident that the parameter is close to zero and are going to be unlikely to be

swayed by a limited amount as data, whereas sigma gets larger we're going to

be more inclined to believe, believe the data early on and move the parameter away

from zero. So we have such a parameter prior over

each data i separately, and they're multiplied together to give us a joint

parameter prior. So the parameter prior over each is going

to, over each parameter is going to look like this.

This sigma^22. is called a hyperparameter.

And it's exactly the same kind of beast as we had for the Dirichlet

hyperparameters in the context of learning these in the works.

The alternative prior that's also in common use, is what's called the

Laplacian Parameter Prior. And the Laplacian Parameter Prior looks

kind of similar to the Gaussian, in that, it has an exponential that, increases as

the parameter moves away from zero. But, in this case, the increase in the

parameter depends on the absolute value of theta i and not on theta i^22, which

is what, the behavior that we would have with the Gaussian.

And so this function looks, looks as we see over here, with a much sharper peak,

around zero, that effectively corresponds to a discontinuity, at theta i equals,

equals zero. And, we have again such a prior, a

Laplacian prior to theta i over each of the parameters, of theta i which are

multiplied together. Just like the Gaussian, this distribution

has a hyperparameter. Which in this case is often called beta.

And the hyperparameter just like the variance, of, in the Gaussian

distribution Sigma squared, dictates how tight this distribution is around zero.

Where tighter distributions correspond to cases where the model is going to be less

inclined to move away from zero, based on limited amount of data.

So now, let's consider what map estimation would look like in the context

of these two distributions. So here we'd have these two parameters

priors rewritten, the Gaussian and the Laplacian.

And, now map estimation corresponds to the arg max over theta of the joint

distribution, P of D comma theta, so we're trying to maximize, oh, we're

trying to find the, theta that maximizes this joint distribution.

And by the simple rules of, probability theory, this joint distribution is the

product of P of D of given theta which is our likelihood.