0:00

In this video, I'm going to explain, how a Boltzmann machine models a set of binary

data vectors. I'm going to start by explaining, why we

might want to model a set of binary data vectors, and what we could do with such a

model if we had it. And then I'm gonna show how the

probabilities assigned to binary data vectors are determined by the weights in a

Boltzmann machine. Stochastic Hopfield nets with hidden

units, which we also call as Boltzmann machines are good at modelling binary

data. So, given a set of binary training

vectors, they can use the hidden units to fit a model per assigns the probability to

every possible binary vector. Per several reasons, why you might like to

be able to do that. If, for example you had several

distributions of binary vectors, you might like to look at a new binary vector and

decide which distribution it came from. So, you might have different kinds of

documents, and you might represent a document by, a number of binary features

each of which says, whether there is more than zero occurrences of a particular word

in that document. For different kinds of documents, you

would expect different kinds of the different words, may be you'll see

different correlations between words And so you could use a set of hidden units to

model the distribution for each document. And then you could pick the most likely

document, by seeing. And then you could assign a test document

to the appropriate class, by seeing which class of document is most likely to have

produced that binary vector. You could also use Boltzmann machines for

monitoring complex systems to detect unusual behavior.

Suppose for example that you have a nuclear power station, and all of the

dials were binary. So you get a whole bunch of binary numbers

that tell you something about the state of the power station.

What you'd like to do, is notice that it's in an unusual state.

A state that's not like states you've seen before.

And you don't want to use supervised learning for that.

Because really you don't want to have any examples of states that cause it to

blowup. You'd rather be able to detect that it's

going into such a state without every having seen such a state before.

And you could do that by building a model of a normal state and noticing that this

state is different from the normal states. If you have models of several different

distributions. You can complete the posterior probability

that a particular distribution produced the observed data by using Bayes' Theorem.

So giving the observed data, the probability it came from Model I, under

the assumption that it came from one of your models, is the probability that Model

I would have produced that data, divided by the same quantity for all models.

Now I want to talk about two ways of producing models of data in particular

binary vectors. The most natural way to think about

generating a binary vector is to first generate the states of some latent

variables, And then use the latent variables to

generate the binary vector. So in a causal model, we use two

sequential steps. These are the latent variables, or hidden

units, and we first pick the states of the latent variables from their prior

distributions. Often in the causal model, these will be

independent in the prior. So their probability of turning on, if

they were binary latent variables, would just depend on some bias that each one of

them has. Then, once we picked a state for those, we

would use those to generate the states of the visible units by using weighted

connections in this model. So this is a kind of neural network,

causal, generative model. It's using logistic units, and it uses

biases for the hidden units and weights on the connections between hidden and visible

units to assign a probability to every possible visible vector.

The probability of generating a particular vector v, is just the sum of all the

possible hidden states of the probability of generating those hidden state times the

probability of generating v, given that you've already generated that hidden

state. So, that's a causal model, factor analysis

for example is a causal model using continuous variables.

And, it's probably the most natural way to think about generating data.

In fact, some people when they say generated model mean, the causal model

like this. But just a completely different kind of

model. A Boltzmann machine is an energy based

model, and, in this kind of model, you don't generate data causally.

4:55

It's not a causal generative model. Instead everything is defined in terms of

the energies of joint configurations of visible and hidden units.

There's two ways of relating the energy of a joint configuration to its probability.

You can simply define the probability to be the probability of a joint

configuration of the visible and hidden variables is proportional to e to the

negative energy of that joint configuration.

Or you can define it procedurally by saying we are going to define the

probability as the probability finding the network in that state after we've updating

all the stochastic binary units for enough time so that we reached thermal

equilibrium. The good news is that those two

definitions agree. The energy of a joint configuration of the

visible and hidden units has five terms in it.

So I've put the negative energy to save having to put lots of minus signs.

And so the negative energy of the joint configuration VH.

That's with vector V on the visible units, and H on the hidden units,

Has bias terms where VI is the binary state of the Ith unit in vector V.

6:23

So that's the first two terms. Then there's the visible-visible

interactions, And to avoid counting each of those

interactions twice, we can, just say, we're going to count within c's, I, and j

and make sure that I's always less than j. That'll avoid counting the interaction of

something with itself, and also avoid counting pairs twice, and so we don't have

to put a half in front. Then there's the visible hidden

interactions. My WIK is a weight on a visible hidden

interaction. And then there's the hidden to hidden

interactions. So the way we use the energies to define

probabilities is that the probability of a joint configuration over vnh is

proportional to e to the minus vh. To make that an equality we need to

normalize the right hand side by all possible configurations over the visible

and hidden and that's what the divisor there is.

That's often called the partition function.

That's what physicists call it. And notice it has exponentially many

terms. To get the probability of a configuration

of the visible units alone, we have to sum over all possible configurations of the

hidden units. So P of V is the sum over all possible Hs,

of each of the minus the energy you get with that H, normalized by the partition

function. I want to give you an example of how we

compute the probabilities of the different visible vectors, because that'll give you

a good feel for what's involved. It's all very well to see the equations,

but I find that I understand it much better when I've worked through the

computation. So let's take a network with two hidden

units and two visible units. And we'll ignore biases, so we just got

three weights here. To keep things simple, I'm not gonna

connect visible units to each other. So the first thing we do is write down all

possible states of the visible units. I need to put them in different colors,

and I'm going to write each state four times,

Because for each state of visible units, there's four possible states of the hidden

units that could go with it. So that gives us sixteen possible joint

configurations. Now, for each of those joint

configurations, we're going to compute it's negative energy minus E.

So if you look at the first line, when all of the units are on.

The negative energy will be +two -one, +one is +two.

9:12

So these are the un-normalized probabilities of the configurations.

Their probabilities are proportional to this.

If we add all those up to 39.7 and then we divide everything by 39.7, we get the

probabilities of joint configurations. There they all are.

Now, if we want the probability of a particular visible configuration, we have

to sum over all the hidden configurations that could go with it.

And so we add up the numbers in each block.

And now we've computed the probability of each possible visible vector in a

Boltson's machine that has these three weights in it.

Now let's ask how we get a sample from the model when the network's bigger than that.

Obviously, in the network we just computed, we can figure out the

probability of everything'cause it's small.

But when the network's big, we can't do these exponentially large computations.

So, if there's more than a few hidden units, we can't actually compute that

partition function, there's too many terms in it.

But we can use Markov Chain Monte Carlo to get samples from the model by starting

from a random global configuration. And then picking units at random and

dating them stochastically based on their energy gaps.

Those energy gaps being determined by the states of all the other units in the

network. If we keep doing that until the Markov

chain reaches its stationary distribution, then we have a sample from the model.

And the probability of that sample is related to its energy by the Boltzmann

distribution, that is, the probability of the sample is proportional to each-(the of

the minus energy. What about getting a sample from the posterior distribution

over hidden configurations, when given a data vector?

It turns out we're going to need that for learning.