0:00

Local structure that doesn't require full table representations is important in

both directed and undirected models. How do we incorporate local structure

into undirected models? the framework for that is called

log-linear models for reasons that will be clear in just a moment.

So whereas, in the original representation

of the unnormalized density, we define p tilde to be a product of factors pi Di

is, which is potentially, a full table. Now we're going to shift that

representation to something that uses a linear form.

So here is a linear form that is subsequently exponentiated.

And that's why it's called log-linear, because the logarithm is a linear

function. So what is this what is this form over

here? It's a linear function that has these

things that are called coefficients and these things that are called features.

Features like like factors, each have a scope which is the set of variables on

which the feature depends but different features can have the same scopes.

You can have multiple features, all of which are over the same set of variables.

Notice that each feature has just a single parameter wj that multiplies it.

So, what does this give rise to? I mean, if we have a log-linear model, we

have we can push in the exponent through the summation and that gives us some,

something that is a, a product of, exponential functions.

So you can think of each of these as effectively a little factor,

but it's a factory that only has a single parameter wj.

Since they are a little bit abstract, so let's look at an example.

Specifically, let's look at how we might represent a simple table factor as a

log-linear model. So here is a parameter, here is a factor

pi over two binary random variables, X1 and X2, and so a full table factor would

have four parameters, a00, a01, a10, and a11.

So we can capture this model using a log-linear model using a set of such

features using a set of these guys which are indicator functions.

So this is an indicator function, it takes one.

If X1 is 0 and X2 and, and X2 is 0 and it takes 0 otherwise.

So this is the general notion of an indicator function.

it looks at the event or constraint inside the, the curly braces and it

returns a value of 01 depending on whether that event is true or not.

And so, if we wanted to represent this distribution as I'm sorry, this factor as

a log-linear model, we can see that we can simply sum up over all the four

values of k and l, which are either 0 or a 1 each of them,

so we're summing up over all four entries here.

And we have a parameter or a coefficient wkl, which multiplies this feature.

And so, we would have a a product or it, it, or a summation of of wkl, of w00 only

in the case that where X1 is 0 and X2 is 0.

So, so we would have exp^-w00 when X1 = 0 and X2 = 0 and we would have exp^-wo1

4:05

when X1 = 0 and X2 = 1, and so on an so forth.

And, it's not difficult to convince ourselves that if we define wkl to be the

negative log of the entries, of the corresponding entries in this table that,

then that gives us right back the factor that we defined to begin with.

So this is so this shows that this is a general representation.

In the sense that we can take any factor and represent it as a log-linear model

simply by including all of the appropriate all of the appropriate

features. So let's, but, but we don't generally

want to do that generally, we want a much finer grain set of features, so let's

look at some of the examples of features that people use in practice.

So here are the features used in a language model, this is a language model

that we had discussed that we've discussed previousl.

And, here, we have features that relate we have,

first of all, let's just reminder ourselves, we have two sets of variables.

we have, the variables Y, which represent the anotations for each word in a

sequence corresponding to what what, to what category that corresponds to. So

this is a person this is the beginning of the person name, this is the continuation

of a person name, the beginning of a location, the continuation of the

location, and so on, as well as as well as a bunch of words that are not, none of

person location, organization and they're all labeled other.

And so, the value Y tells us for each word what category it belongs to, so that

we're trying to identify people locations and organizations in a sentence.

We have another set of variables, X, which are the actual words in in the

sentence. Now we can go ahead and define, we can,

we can use a full table representation that basically tries to relate each and

every Y to that, that has a, a feature, that has a full factor that looks at

every possible word in the English language, but those are going to be very,

very expensive and a very large number of parameters.

And so we're going to define the feature that looks, for example, at F of say, a

particular Yi, which is of the label for the ith word in the sentence and Xi being

that ith word. And that feature says, for example,

YiI) equals person, it's the indicator function for Yi equals

person and Xi is capitalized. And so, that feature doesn't look at the

individual world, words, it just looks at whether that feature is, that word is

capitalized. Now we have just a single parameter that looks just at

capitalization and parameterizes how important is capitalization for

recognizing that something is a person. We could also have another feature.

This is an alternative feat, this is a different feature, it can, and they're

all going, it could be part of the same model, that says, Yi is equal to location

or actually I, I was a little bit imprecise here,

this might be beginning of person, this might be beginning of location and Xi

appears in some Atlas. Now, there are other things that appear

in the Atlas other than locations but if a word appears in the Atlas, there's a

much higher probability presume that it's actually a location.

And so and so we might have, again, a weight for this feature that indicates

it's that, that, maybe increases the probability in, in Yi being labeled in

this way. And so you can imagine that constructing

a very rich set of features, all of which look at certain aspects of the word and

rather than enumerating all of the possible word, words and, and giving a

parameter to each and every one of them. Let's look at some other examples of

feature-based model. So, this is an example from statistical

physics, it's called the Ising model.

And the Ising model is something that looks at pairs of variables,

so it's a pair-wise Markov network. And looks at pairs of adjacent variables

and basically gives us a coefficient for their product.

So now, this is a case where variables are in the in, are, are binary but not in

the space 0, 1 but rather -1 of, and +1. And so, now we have a model that's

parameterized as features that are just the product of the values of the adjacent

variables. Where might this come up?

It comes up in the context, for example, of modeling the spins of of, of, of

electrons in a grid. So here, you have a case where the

electrons can rotate either along one direction or in the other direction.

So here is a bunch of the bluer, the, the, the the atoms that are marked with a

blue arrow, you have one rotational axis, and the red arrow are rotating in the

opposite direction. And, you, and this basically says, we

have a term that depend, who, probability distribution over the joints set of

spins, so this is the joint, the joint spins in

the model, depends on whether adjacent atoms have the same spin or opposite

spin. So, notice that 1 * 1 is the same as -1 *

-1, so this really just looks at whether they have the same spin or different

spins and there is a parameter that looks at, you know, same or different.

That's what this feature represents and depending on the value of this parameter

over here, if the parameter goes one way, we're going to favor systems that where

the atoms spin in the same direction. And if it's going in the opposite

direction, you're going to favor atoms that spin in a different direction and

those are called ferromagnetic and antiferromagnetic.

Furthermore, you can define in these systems the notion of a temperature.

So the temperature here says how strong is this connection?

So notice that as T grows, as the temperature grows, the Wij's get divided

by T and they all kind of go towards 0, which means that the strength of the

connection between adjacent atoms effectively becomes almost smooth and

they become almost decoupled from each other.

On the other hand, as the temperature decreases,

12:26

Another kind of feature that's used very much in lots of practical applications is

the notion of a metric of a metric feature in an MRF.

So what's a metric features? It's, this is something that comes up,

mostly in cases where you have a bunch of random variables X that all take values

in some joint label space of V. So for example, they might all be binary

or they might all take values one, two, three, four.

and, what we'd like to do is [COUGH] we have well, we have Xi that are connected

to each other, Xi and Xj that are connected to each other by an edge.

We want Xi and Xj to take values. So in order to enforce the fact that Xi

and Xj should take similar values, we need a notion of similarity.

And we're going to encode that using a distance function mu that takes two

values, one for Xi and one for Xj and says, how

close are they to each other? So what does a distance function need to

be? Well, a distance function needs to

satisfy the standard conditions on a distance function or a metric.

So first is reflexivity, which means that if the two variables, if

the two variables take on the same value, then that distance better be zero.

oh, I forgot to say this, sorry, this needs to be a non-negative function.

symetry, means that, that the distances are symmetrical, so

the distance between two values V1 an V2 are the same as the distance between V2

and V1. And finally, is the triangle inequality,

which says that the distance between V1 and V2, so here is V1, here is V2, and

the distance between V1 and V2 is less than, than the distance between V1 and

V3, and then going to V2. So the standard triangle inequality.

If a distribute, if a distance disatisfies these two conditions, it's

called a semimetric, otherwise, if it satisfies all three,

it's a called a metric and both were actually used in practical applications.

15:52

So, conversely, if you have vary, values that are far from each other in the

distance metric, the lower the probability of the model.

So here are some examples of metric MRFs. so one, the simplest possible metric MRF

is some, is one that gives distance of 0 when the two classes are equal to each

other and distance of one everywhere else.

So now, there is you know, it's just like a step function.

And, despite its, and that gives rise to a potential that that looks like this.

So we have zeroes on the diagonal, so we get a bump in the probability when

the two adjacent variables take on the same label.

And otherwise, we get a reduction in the probability,

but regard, but it doesn't matter which particular values they take.

That's one example of a simple metric, a somewhat more expressive example might

come up when the values V or actually numerical values, in which case you can

look at maybe the difference between numerical values, so Vk - Vl, and you

want and when Vk is equal to Vl, the distance is 0 and then, you have a

linear function that increases the distance as the distance between Vk and

Vl grow. So, this is the absolute value of Vk

minus Vl. A more interesting notion that comes up a

lot in practice is when you don't want to penalize arbitrarily things that are far

away from each other in label space. And so this is what's called a truncated

linear penalty, and you can see that beyond the certain threshold,

17:44

there, the penalty just becomes constant. So it plateaus.

So there is a penalty, but it doesn't keep increasing over as the labels get

further from each other. One example, where metric MRFs are used

is when we'ere doing image segmentation and here we tend to favor segmentations

where adjacent superpixels, so these are adjacent superpixels and we want them to

take the same class. And so, here we have no penalty when the

Super superpixels take the same class and we have some penalty when they take

different classes. And this is actually a very common,

albeit simple model for for image segmentation.

Let's look at a different MRF, also in the context of of computer vision.

This is an, an MRF that's used for image denoising.

So here we have a noisy version of a real image that looks like this.

So this is, you can see this kind of white noise overlayed on top of the image

and what we'd like to do is we'd like to get a cleaned up version of the image.

So here we have a set of variables X that correspond to the noisy pixels.

And we'd let, we have a set of variables y that correspond to the clean pixels and

we'd like to have a probabilistic model that relates X and X and Y. And,

what we're going to do is we'd like, so into, I mean, the the, so you'd like to

have two effects on the pixels Y. First, you'd like Yi to be close Yi to be

close to Xi, but if you just do that, then you're just

going to stick with the original image. So what is the main constraint that we

can employ on the image in order to clean it up is the fact that adjacent pixels

tend to have the same value. So in this case, what we're gong to do,

is we're going to model, we're going to constrain the image, so

that, we're going to restrain the Yi's to try to make Yi close to its neighbors and

the further away it is, the bigger the penalty and that's a metric MRF Now we

could use just a linear penalty, but that's going to be very fragile

model, because not obviously, the right answer isn't the model where all pixels

are, are equal to each other in their actual intensity value because that would

be just like a single, you know, grayish looking image.

So what you'd like to do is you'd like to let the one pixel depart from its

adjacent pixel if it's getting pulled in a different direction either by its own

observation or by other adjacent pixels. And so the right model to use here is

actually the the truncated linear model and that one is, is one that's commonly

used and is very successful for doing image denoising.

Interestingly, almost exactly the same idea is used in the context of stereo

reconstruction. There

the values that'd you like to infer, the Yi's, are the depth desparitey for a

given pixel in the image, how deep it is, and here, also we have spatial continuity

we'd like the depth of one pixel to be, close to the depth of an adjacent pixel.

But once again we don't want to enforce this too strongly, because you do have

depth dexterity's in the image. And so eventually, you'd like things to

be able to break away from each other, and so once again, one typically uses

some kind of truncated linear model for doing thi theory reconstruction often

augmented by other little tricks. So for example here we have the actual

pixel appearance, for example, the, the, color and texture and if the color and

texture are very similar to each other, we might want to have a stronger

constraint on similarity versus if the color and texture of the adjacent pixels

are, are very different from each other. They may be more likely to belong to

different objects and you don't want to enforce quite as strong of a similarity

constraint.