0:01

One very important variant of Markov networks, that is probably at this point,

more commonly used then other kinds, than anything that's not of this type is

what's called a conditional random field. So a conditional random field,

you can think of it as a, something that looks very much like a Markov network,

but for a somewhat different purpose. So let's think about what we are trying

to do here. This class of model is intended to deal

with what we call task-specific prediction,

that where we have a set of input variables for observed variables, X, we

have a set of target variables that we're trying to predict y.

And, the class of models is intended to, is designed for those cases where we

always have the same types of variables is the instance variables in the same

types of variables as the targets. And we're always trying so solve what is

essentially the exact same problem or the exact same type of problem.

So let's think about that as, let's think about two examples along these lines.

So again, going back to the example of image segmentation.

Here we have the input is always a set of pixels with their values, pixel values,

which we can process to produce more expressive features like histograms of

colors and textures, but this is sort of not relevant at this

point or although it will be in a minute. So my input x is pixel values and

processed features for those and my hardy variables y is a class for every pixel.

So for example, whereby class I mean, you know, grass, sky, cow, water, and so on.

And we're always trying to, we design the model so as to solve this problem of

going from X to Y and we want to solve that problem.

In some of the text processing or Nlt problems that we've discussed, the the

pixel, the input X are the words in a sentence.

2:30

And the output that we're trying to derive is, for example, the labels of the

words, where again, the, the labels might be in

this case things like person, location, organization, and so on.

And once again, this is always our input, the words and the output is always the

labels. So,

why, what is the problem? So, fine.

So we want to solve this this kind of prediction problem.

Why not use just a regular old graphical model in the same way that we've done so

far? So let's think about what some of the

issues might be with that. So imagine that we are trying to predict

in this case Ci which is the label of a particular superpixel in this case, i.

And we're going to go ahead and, process our features of that superpixel to

provide a range of different color and texture histograms that that represent

you know, different ways of measuring the, the

appearance of the superpixel. So in this case, that's my y,

just as a sort of, in the notation of the previous slide.

This is my target y and these guys down here are my X's.

So what's the problem with this? The problem with this is that these

features that are often much more informative about about these about the

classes of pixel of which are very correlated with each other.

So, for example the the texture histogram which tells us sort of a direction in

which lines go in a particular superpixel.

They're very redundant in the terms of the kinds of structures, the kind of

texture that they measure, and so if you have features here that are

very correlated with each other, so where two variables or two features

have a lot of redundant information, if we represent this in a naive Bayes model

as I drew here where the features are independent,

4:52

given the label, then I'm effectively ignoring that

correlation structure. Now, why is that bad?

It's bad because it allows me to count the same feature again and again and

again and again and again. So if I have five copies of a feature or

five very correlated related features that effectively measure the same thing,

I count it five times, which means I get really confident because of this one

feature. If I have a 100 copies of that feature,

it counts it 100 times. So it, and again, it pushes me towards

very skewed probability distribution that don't really that are not really good

reflections of of my true beliefs, because they make

incorrect independence assumptions. Well, fine.

So let's, make correct independent assumptions.

Let's add a bunch of edges to capture the correlations.

5:58

Well, that turns out to be really hard, because once you start doing things that

have trivial features. figuring out how correlated is hard and

furthermore even trying to put in these correlations gives rise to, so first of

all this is hard to figure out and it gives rise to densely, connected models.

If I start putting in all these edges between these features.

So a completely different solution to this problem,

basically says, well, I don't care about the, the image features.

I don't want to predict the probability distribution of those pixels.

I want, I don't care to predict, you know, I'm not trying to do image

synthesis. I don't care about the probability of

having a, you know, a green pixel next to another green pixel next to a brown

pixel. I'm trying to use a pre-given to me,

something that is already given, which is the features X, and the only thing I

really care about is to model the distribution over y.

So I reformulated my problem, instead of modeling,

the joined distribution of X and Y together.

7:18

Instead of doing this, I'm going to model a conditional distribution of y given x.

We're now trying to capture distribution over X.

Now, if we're not trying to capture the distribution over X, then I don't really

care about the correlation between them. We'll see that a little bit more clearly

when we look at an example of it, but before we do that, let's first give

ourselves the formal definition of this thing, which is called a CRF which stands

for conditional random field. And, a condition random feel at first

glance, looks just like it gives distribution.

Though, just like it gives distribution, we have a set of factors each with their

scope. And just like the Gibbs distribution, I

multiply, the [INAUDIBLE] to get an unnormalized

measure. And,

in order, and now so far, it's exactly the same except that I have a x and a y

on the, in the, as the arguments of the of a normalized measure rather than just

a set of random variables, this is where the difference comes in.

If I want to model a conditional distribution of Y given X, I'm going to

have to put X on the right-hand side of the conditioning bar,

which means I'm going to have a separate normalization constant a partition

function that is a function of X. And so what does that mean?

It means that for any given X i sum over y,

9:23

which means that for any given X I'm going to have the sum of all the Y's that

correspond to that X, and then I'm going to construct the

conditional distribution by normalizing, by I'm going to construct an additional

distribution for Y over Y for any given X,

this shouldn't have a tilde. by dividing by this X specific partition

function. So to be a little bit more concrete,

let's imagine that I have a table here and I'm going to make life simple for

ourselves, for myself, and this is my table for X1y, X0y0, X0Y1, X1Y0, and

X1Y1. And for any given x, in this case this

for example these two entries right here. I'm going to sum them up,

that's going to give me the normalizing constant for these two entries together,

and I'm going to divide by that Z of X zero.

And that guarantees that these two entries now sum to one, because I just

basically divided by their appropriate normalizing constant.

And similarly, [COUGH] I'm going to add these over here and I'm going to divide

this entry by Z, of X1. And so now, to find a family of

conditional distributions, so this is a family of conditional distributions,

11:10

which varies by X. Now, it turns out that CRFs are highly

related to a model that we have previously seen in, in many machine

learning classes. And also comes up as a CPD representation

in, in Bayesian network, which is the logistic function or the

logistic model. So in order to understand that

connection. Let's look at a very, very simple case

where we have both Xs and Ys are binary random variables,

so they take on values in the space 01. And, we're going to use in our log-linear

model a simple representation of my log-linear features,

which is just the indicator function. So we have 1, my indicator function is 1

when both Xi and Y when Xi and Y are both 1, and 0 otherwise and that's going to

have a parameter, Yi. And now let's actually plug that into the

CRF representation and see what happens. So, let's look at phi i of Xi and Y = 1.

Now, if you look at that, in this case Y is always 1,

and now, we have two choices. Either Xi is 1 in which case the

indicator function is 1 or Xi is 0 in which case the indicator function is 0.

So you can rewrite this as Wi times Xi, X, sorry.

X [INAUDIBLE] Wi Xi. Right? Because if Xi is 0, then you get X

of Wi times 0 and if Xi is 1, you get X of Wi.

Hm. On the other hand, if I make Y = 0,

then my indicator function is 0 because if Y is 0, then the indicator function is

0 and at that point I get exponential 0 which is equal to 1.

Kay?' So these are my two cases.

I have, what is the value of this factor for Y = 1, what is the value of this

factor for Y = 0? And now I can go ahead and compute my

unnormalized density and my unnormalized density has the following

again, I'm going to do two separate cases.

There is the case when Y = 0 and a case when Y = 1.

And my unnormalized density in the case Y = 1 is just multiplying all these guys up

here. So multiplying all these guys,

which gives rise to this exponential, over here.

But just because the product doesn't, a bunch of exponentials is the exponential

of the sum. Okay?

So that is, in this case, and here clearly the product of a bunch of ones is

one, so that's pretty straightforward.

And so now, putting this all together to produce a normalized density, remember,

my goal is to produce a density, a normalized density, which is the P of Y,

my binary value Y given all of the features X.

And if you just stick this together, this is, when you just rewrite this, this is

the probability of Y = 1, X and this is divided by the probability of Y1, = 1,

X plus the probability of Y0, = 0, X which is why my normalizing constant.

Well, this is a very familiar shape. It's the sigmoid function.

It's E to the power of Z, if we call this Z.

15:31

So broadening from this from this particular mathematical derivation,

what we have here is that if we were to think about the model where we have a

joint density of Y and X together as a kind of naive Bayes model, because

normally we have pairwise terms that relate the Y and the Xi so you can think

of it as a bunch. If, if we weren't doing this conditional

normalization, we would have a model that looks like this.

We would have, you know, the Y and we would have a pairwise feature that

relates Y to X1, Y to X2,

up to Y to Xn. And, that would be effectively like a

naive Bayes model because, since we don't have any features, pairwise features that

represents, connect the Xis to each other, we would effectively have a model

where the Xi's are independent given the Y.

So, if I weren't doing a conditional model, if I were to just use this set of

potentials. It represents the joint distribution of X

and Y, it would be a effectively like a naive

Bayes model and it would make very strong independence assumptions.

But because a modeling, this is a conditional distribution,

like this, this is the conditional distribution.

I've effectively removed from this analysis any notion of a correlations

between the Xs and I'm just modeling how the Xs come together to effect the

probability of Y. And so, that's really the difference

between a naive Bayes model and, and a logistic lagression model.

And that same intuition extends to much richer of classes of, of models, where I

don't just have binary variables in a single Y,

but rather a very rich set of Ys and Xs. And nevertheless, this ability to sort of

ignore the distribution over the features and focus on the target variables allows

me to exploit allows me to sort of ignore

correlations between rich features and not worry about whether they're

independent of each other or not. So, for example, going back to are notion

of CRFs for image segmentation here, we typically have very rich features of the

variables. so for example, we think about individual

node potentials that relate the features Xi to the class label Yi,

we don't worry about how correlated the features are.

We can have color histograms, texture features.

You can have discriminative patches, like looking through the eye of the cow, for

example. And all of these are going to be really

correlated with each other, but I don't care.

You can even look at features that are outside of the superpixel.

You can say, oh, well, if it's, you know, green underneath in a completely

different superpixel maybe it's more likely to be a cow or a sheep because

they tend to be on grass. That's cool, too.

These are definitely correlated, because you're counting the same feature for my

superpixel as well as for a different superpixel.

That's fine. I don't care, because I am not worried

about the correlations between the superpixels.

So the correlations don't matter. You can even, and this is very commonly

done, train your favorite discriminative classifier, a support vector machine,

boosting, random force to anything that you like to predict the probability of Yi

given a whole bunch of image features X. And that's fine, too.

And in fact, that is how one achieves high performance on most of these tasks

by training very strong classifiers for in most cases, your note potentials.

That is the predictors for individual variable and then adding on top of that

pairwise features or, or not just pairwise but a higher order feature is

20:16

And these Fs that we use to define the log-linear model and features are

actually used for both which is an unfortunate thing, but there it is.

So, hopefully, this will be clear from context.

the same kind of idea applies when we do CRFs for language.

Again, we usually have here features that are very highly correlated with each

other. So, for example, whether the word is

capitalized is correlated with whether the word is in some atlas or name list,

definitely correlated features correlated with whether the previous word is missis

or mister. All sorts of other features that are

often very correlated with each other. You can even see that the same, that this

word is used as a feature for more than one word. That's fine, it's not a

problem, because we don't try and model this distribution over the words in the

sentence but rather the probability of the labels given the words.

21:17

So to summarize a CRF is deceptively like any other Gibbs distribution, but a

critical, a subtle but critical difference is that it's normalized

differently. It's normalized so that you're creating a

conditional distribution of a Y given X. As we've seen that as a special case, it

subsumes your standard logistic regression model, but has a lot of other

but has a much richer expressive power. A key feature of it is that we can, we

don't need to model the distribution over variables that we don't care about, only

the ones that we actually care about predicting.

And that a critical utility of this is that you can design really, really

expressive predictors of pieces of the model without worrying about incorrect

independencies between between different variables, which would be inevitable if

you actually tried to model this distribution, joint distribution over

these expressive features.