Plus BK. Since we don't want to write this down multiple times,

let's say that this thing equals to some constant M,

and find the close constant.

So now, I want to estimate the distribution QK but for now it's only up to a constant.

Let's take the exponent of both parts,

and also remember that the interval of QK should be equal to one.

In this case, it means that Q of plus one,

plus Q of minus one should be equal to one.

We can plug in this formula here.

I will have exponent of,

so here Yk equals to plus one exponent of M times the exponent of the constant,

let's right it down C,

plus again the same constant C,

and the E to power of minus M, and it should be equal to one.

C here should be equal to one over E to the power of M,

plus E to the power of minus M. This is the value for the constant.

And finally, we can compute the probabilities.

So the probability that Q equals one,

equals E to the power of M over this constant C,

each with the power of M, plus E to the power of minus M. What is this function?

What do you think? We can multiply it by each with the power of minus M,

we'll have one over one plus E with the power of the minus 2M,

and actually equals the sigmoid function of 2M.

All right, so now we can update the distributions,

however, we need to do one more thing.

Notice here that we used the value of μj.

For the other nodes to be able to use these constants μj,

we need to update the μj for our node.

We need to compute the μK. This is an expectation of Yk.

It is simply Q at the position plus one,

minus Q at the position of minus one.

We can again plug in the similar formulas.

This would be each with the power of M minus E to

the power minus M over the normalization constant.

As you may notice,

this actually equals to the hyperbolic tangent.

Here's our final formula.

Let's again and see how it works.

We iterate our nodes,

we select some node.

We compute the probabilities Q,

and then update the value of μk.

And while we update the probabilities Q, we also use the values μj.

Also actually, here it is QK,

which is actually true since we're estimating the values for the μK.

Now that we've derived an update formula,

let's see how this one will work for different values of J.

Here's our setup.

We have two areas,

the white one corresponds to the positive external field,

and the black one corresponds to the negative external field.

If J is 0,

then with probability one on the white area,

the spins would tend to be plus one.

On the black area,

the probability would be one for having minus one.

And everywhere else we'll have the probability 0.5 for each possible state.

It happens because there is no interaction between neighboring points since J is 0.

You will have the negative J.

We'll have a chess-like field.

The neighboring points would try to have opposite signs,

there will be blacks and whites nearby.

As we go further from external field,

the interaction is slower,

and so when we're really far away from the field,

the probability is actually nearly 0.5,

which will indicate that there can be either plus one or minus one.

All right. The final example is a strong positive J.

In this case, we'll get a picture like this,

one part would be white that means that we'll

have plus one with probability one on the left upper corner,

and everywhere else we'll have minus one with probability one.

So actually, this situation should have be symmetric.

Why didn't we get the opposite picture when there would be a right lower corner,

black, and other things would be white?

This is actually a property of the KL diversions.

Here I have a [inaudible] the bimodal distribution,

and I try to, approximate it by fitting the KL diversions.

There could be two possible cases.

One is left is that the KL diversions would fit one node,

and on the second one is that we fit something in the middle.

What do you think would happen when minimize the KL diversions

between the bayesian distribution and [inaudible].

Let's first of all see what are the properties of those two things.

The second one captures the statistics,

so it would have for example the correct mean.

However, the first one has

the very important property that the mode has high probability.

In the second example,

the probability of mode is really low.

It seems that the mode is actually impossible, and so,

for many practical cases,

the first fit would be nicer and actually this is the case.

Let's see why it happens. All right.

So here's our KL diversions.

It is an integral of Q of Z times log of the ratio.

Let's see what happens if we assign

non-zero probability to the Q and zero probability to the P-star.

In this case, the KL diversions would have a value of plus infinity.

And so, the KL divergence would try to avoid

giving non-zero probability to the regions that are impossible from the first tier.

It is called a zero avoiding property of KL divergence,

and it turns out to be useful in many cases.