We've previously talked about maximum likelihood destination from Markov random

fields. Though when we first defined Markov

random fields, we also defined a, an extension on MRFs called conditional

random fields or CRFs. So, now let's think about how the ideas

that we developed in the context of MRFs can be utilized for the maximum

likelihood destination of a conditional random field.

So as a reminder the point of the CRF was to compute the probability of a

particular set of target variables Y given an, a set observe variables X. And,

the idea behind this was that, although, the unnormalized density p tilde of X, Y

was defined in exactly the same kind of parameterization,

whether there's a product of factors or as a log-linear model.

As in the MRF case, the difference here is that we have a a,

an X specific partition function that guarantees that what we have is a

normalized distribution over the Y variables given a particular X.

So this partition function is the sum over all Ys of the unnormalized measure p

tilde theta of X, Y and so it normalizes only all the Ys and not just, and not the

Xs. When learning CRFs, our data is now a set

of pairs. It's a set of input variables or observed

features X and a set of target variables Y.

And in the training data, both of these are observed and we'd like to use this to

learn conditional distribution of Y given X.

So the right objective function that we're going to apply in this context is

the what's often called the conditional log likelihood but really should be

called the log conditional likelihood, because the conditioning is inside the

log. And what we have in terms of this of this

objective is the sum over all data instances m of the log of the conditional

probability of Y given X. So, if we look at the conditional log

likelihood and we basically open it up, we end up with a, an expression that

looks very similar to what we saw in the case of MRFs.

We have a sum over i of, and this is for the case of a log linear

model, we have a sum over i of the parameter

theta i multiplied by the value of the feature, fi.

This is for a particular instance XY. So it's theta i fi of Xm, Ym.

And, over here, we have a in, a negative minus log of the

partition function for X of m and this uses the exact same derivation that we

had for the case of an MRF. And so now we can sum up over multiple

data instances m and compute the derivative relative to a particular

parameter theta i of the one over m of the log conditional likelihood.

and then, what we end up with is a, again, again an average of, in this case,

the sum over two expectations. So, what we have in the, we have a

summation over m and the first term is the value of the feature function fi of

Xm, Ym and the second term is the expected,

expectation of that feature fi relative to Xm and Y.

Now, it's important to look carefully at the second expression over here,

E theta of fi Xm, Y. And note, that here, the expect, the Xm

is fixed and the expectation is taken only over the variables Y.

And this is in contrast to the case of MRFs where we were taking the expectation

relative to the entire set of variables. Let's see how this pans out in the

context of a particular example, so we're going to look a really brain dead notion,

model for image segmentation, where we just have two features, f1 and f2.

The first feature, f1, is a singleton feature and it takes the value of the

average green value for pixels in a superpixel s.

So if this is my superpixel s, we average out all of the green superpixels, but we

only put them into that feature in cases where that pixel is labelled with g for

grass. So and notice that this feature is

repeated across all superpixels s in the image, so this is a model with shared

parameters. The second feature, f2, is a pairwise

feature, and here we get a one for cases where two adjacent superpixels s and t

are labeled the same value, say both grass or both cow.

And this allows the model to learn how likely it is for superpixels that are

adjacent to each other to have the same class label.

Notice that this model too had this feature too is shared both across pairs

of superpixels, but also across different classes.

That is we have the same parameter for cow, cow as we do for grass, grass, or

sky, sky. So let's plug in these two features into

the the model for the radiant that we had on the previous slide, so this is the

general equation and let's see what happens for theta one.

So the partial derivative relative to the parameter theta one has the difference of

two expectations, again, and empirical expectation, and a

and a model expectation. So, this is for particular instance m, so we're not

currently averaging across multiple training images. We have a single image m

that we're working over. So here, because the model has shared

parameters, we're summing over all possible superpixels s in the image and

the empirical expectation sums up the product of this indicator function for

the superpixel s multiplied the, this average greeness of the superpixel and

that's the empirical expectation. The model-based expectation has exactly

the same form, we're also summing up over all superpixels s and we similarly have a

product of two terms only here, because we don't know whether, because in this

model-based expectation, we don't have the Y labels, we're computing the

probability over the Y label taking the value grass. That is for each superpixel

s, we compute the probability that Ys takes the value green given but we know

about the image, that is the, the values x of m and that gets multiplied with the

average greenness again. So that's the gradient for the first

parameter. The gradient for the second parameter is

very similar in spirit. So here, we're looking at the sum over

all pairs of adjacent superpixel s and t and the empirical expectation just sums

that over all those pairs with an indicator function of Ys is equal to Yt.

So we a one if Ys is the same as Yt Say, they're both labeled grass or they're

both labeled cow, and zero otherwise. And in the, in the model based

expectation term, we have the probability of that event.

That is, the probability that ys is = to yt.

given the image. And once again, that's summed up over all

pairs of, adjacent superpixels, s and t. So, again, a difference of two

expectations in both cases. An empirical expectation, and a model

based expectation. Taking that and thinking about this let's

compare the computational cost of two of two models, of two training regimes.

One is the MRF training and the second is the CRF training.

So in the MRF training remember that our gradient relative to parameter theta I

was the difference of two expectations. And we've already pointed out that the

computation of the second expectation, the model based expectation requires that

we run inference at each gradient and we thought that was pretty expensive or it

could be, but now lets see what happens for serifs, lets look at that gradient

for more time, for the case of serifs and here notice that we don't really have.

That, that we have a summation over M. That is a summation over all of the data

instances of a difference of two terms. The second is the feature value in that

particular instance. And the second, is the expected feature

value in that particular instance, X of M.

So. The important observation to come out of

this is that this expectation, the second expectation, relative to theta, is an

expectation that is different for each of our input instances.

So that means that we need to rerun this inference for each and every one of our x

m's at each gradient step. So whereas mrf's requires one inference.

At each gradient step. This requires M inferences at each

gradient step. Where M is the number of training

instances. And so here, the inference required to do

learning is considerably more expensive. However, one needs to weigh the balance

of all of the factors that contribute to computational cost.

So, when we're doing inference of the conditional probability, P of Y, given X.

We only need to worry about the y variables.

So the Xs are all fixed, their values are instantiated.

And so our factors in the resulting model are factors only over the variables one.

If we were to want to use mrf training because say we decided that crf trainings

too expensive because of the additional cost of inference we would need to

compute the joint probability p of yx which.

Might be much more complicated. And that might be much more complicated

not only because of the fact that we have more variables now in the model both the

y's and the x's. But also because in many cases where we

would like to apply crf, the distribution that we might have over the x's becomes

quite un-, unmanageable if we want to compute a distribution over that.

So to understand that, let's go back to the previous example and consider this

very simple image segmentation model. And here notice that the x.

The, the variable x sub s implies this This average greenness of the superpixel.

Now, in the context of CRFs, we don't need to compute a distribution over

average greenness, because we're conditioning on the X of s.

And so g of s is now fixed, it's instantiated.

And we're only computing the probability distribution over the ys.

But if we wanted to train this model as an MRF, we would need to somehow maintain

a distribution over this average greenness.

And since that's a continuous quantity. It actually requires that we think about

parametric forms. And is it a Gaussian, or, or a mixture of

Gaussian. Gaussians, and it becomes quite hairy to

manage. And so.

Although there, the CRF based training in this case, might have additional cost, it

can also reduce the cost, in terms of any particular inference step as well as

avoided a bunch of thorny issues regarding, parameterization of the

distribution over various features. [SOUND] So to summarize, CRF learning, in

terms of the mathematical formulation is very similar to MRF learning.

The likelihood function has the same form, it's concave.

It's similarly optimized using gradient ascent, usually the same L-BFGS

algorithm. However the gradient computation is now

much more expensive, because it requires inference not only once per gradient

step, but also once per gradient step and data instance,

and as a comparison, once per gradient step in the context of MRF's.

But as we already said, as we just said, the conditional model is often much

simpler, so the inference calls for a CRF and the MRF is not the same, so we're not

really comparing apples and oranges. And, so in the context of any given

model, one really needs to think about which of these models is more

appropriate. And not only based on the computational

costs of training, but also in terms of the overall performance and

generalization.