0:00

In this video I'm going to describe how to use an RBM to model real value data.

The idea is that we make the visible units.

Instead of being binary stochastic units, the linear units with Gaussian noise.

When we do this, we get problems with learning.

And it turns out a good solution to those problems is to then make the hidden units

be rectified linear units. With linear Gaussian units for the

visible, and rectified linear units for the hiddens, it's quite easy to learn a

restricted Boltzmann machine that makes a good model of real value data.

We first used restricted Boltzmann machines with the images of handwritten

digits. For those images.

Intermediate intensities caused by a pixel being only partially inked can be

modelled quite well by probabilities, that is numbers between one and zero that

are actually the probability of a logistical unit being on.

So we treat partially inked pixels. As having a probability of being inked.

This is incorrect but it works quite well.

However it won't work for real images. In a real image the intensity of a pixel

is almost always, almost exactly the average of its neighbors.

So its got a very high probability of being very close to that average and a

very small probability of being a little further away.

And you can't achieve that with a logistic unit.

Mean field logistic units are unable to represent things like the intensity is

69. but very unlikely to be 71. or 67. So we need some other kind of unit.

The obvious thing to use is a linear unit with Gaussian norms.

So we model pixels as Gaussian variables. We can still use alternating, get

sampling, to run the Markoff chain required for the cross-divergence

learning. But we need to use a much smaller

learning range, otherwise it will tend to blow up.

The equation looks like this. The first term on the right hand side, is

a kind of parabolic containing function. It stops things blowing out.

So determining that sum contributed by the Ith visible unit is parabolic in

shape. It looks like this.

It's parabola with its minimum at the bias of the Ith unit.

And as the Ith unit departs from that value, we add energy quadratically.

So that tries to keep the Ith visible unit close to VI.

The interactive term between the visible and the hidden units looks like this.

And if you differentiate that with respect to the I, you can see that you

get a constant. It's the sum over all J, of H J W I J

divided by sigma I. So that term with its constant gradient

looks like this. And when you add together, that top down

contribution to the energy is linear, and the parabolic containment function.

You'll get a parabolic function, but with the mean shifted away from BI.

And how much it shifted depends on the slope of that blue line.

So the effect of the hidden units is just to push the mean to one side.

It's easy to write down an energy function like this.

And it's easy to take derivatives off it. But when we try learning with it, we

often get problems. There were a lot of reports in the

literature that people could not get these Gaussian binary RBM's to work.

And it is indeed extremely hard to learn tight variances for the visible units.

It took us a long time to figure out why it's so hard to learn those visible

variances. This picture helps.

If you consider the effect that visible unit I has on hidden unit J.

When visible unit I has a strong standard deviation sigma I, that has the effect of

exaggerating the bottom up weights. That's because we need to measure the

activity of I in units of its standard deviation.

So when the standard deviation is small, we need to multiply the weight by a lot.

If you look at the top down effect of J on I, that's multiplied by sigma I.

So when the standard deviation of a visible unit I is very small, the bottom

up effects get exaggerated, on the top down effects get attenuated.

The result is that we have a conflict where either we have bottom up effects

that are much too big or top down effects that are much too small.

And the result is that the hidden units tend to saturate and be firmly on or off

all the time, and this will mess up learning.

So the solution is to have many more hidden units than visible units.

That allows small weights between the visible and hidden units to have big top

down effects, because of so many hidden units.

But of course, we really need the number of hidden units to change as that

standard deviation sigma I gets smaller. And on the next slide, we'll see how we

can achieve that. I'm going to introduce stepped sigmoid

units. The idea is we make many copies of each

stacastic binary hidden unit. All the copies have the same weights, and

the same bias that's learned B But in addition to that adapted bias B they have

a fixed offset to the bias. The first unit has an offset of -1.5. The

second unit has an offset of -1.5. The third one has an offset of minus -2.5,

and so on. If you have a whole family of sigmoid

units like that, with the bias changed by one between neighbouring members of the

family, the response code looks like this.

If the total in product is very low, none of them are turned on.

As it increases, the number that get turned on increases linearly.

This means that as the standard deviation on the previous slide gets smaller, the

number of copies of each hidden unit that get turned on gets bigger and we achieved

just the effect we wanted, which we get more top-down effect to drive these

visible units that have small standard deviations.

Now it's quite expensive to use a big population of binary stochastic units

with offset biases, because for each one of them, we need to put the total input

through the logistic function, but we can make some fast approximations which work

just as well. So the sum of the activities of a whole

bunch of sigmoid units with offset ballasts, which is shown in that

summation. Is approximately equal to log of one plus

E to the X and that in turn is approximately equal to the maximum of

nought and X. And we can add some noise to the X if we

want. So the first term in the equation looks

like this. The second term looks like that.

And you can see that the sum of all those sigmoids in the first term will be a

curve like that. And we can approximate that by a linear

threshold unit that has a value of zero unless it's above threshold.

In which case its value increases linearly with its input.

Contrastive Divergence Learning works well for the sum of a bunch of stochastic

logistic units with offset biases. And in that case.

You get a noise variance that's equal to the logistic function.

But the output of that sum. Alternatively we can use that green curve

and use rectified linear units. They're much faster to compute because

you don't need to go through the logistic many times.

And can trust divergence works just fine with those.

One nice property of rectified linear units is that if they have a bias of

zero, they exhibit scale equivariance. This is a very nice property to have for

images. What scale equivariance means is that if

you take an image x and you multiply all the pixel intensities by a scalar a.,

then the representation of ax in the rectified linear units would be just a

times the representation of x. In other words, when we scale up all the

intensities in the image, we scale up the activities of all the hidden units but

all the ratios stay the same. Rectified linear units aren't fully

linear because if you add together two images, the representation you get is not

the sum of the representations of each unit separately.

This property of scale equivariance is quite similar to the property of

translational equivariance, convolutional nets off.

So if we ignore the pooling for now, in a convolution on that, if we shift an image

and look at the representation, the representation of a shifted image is

just a shifted version of the representation of the unshifted image.

So in a convolutional net without pooling, translations of the input just

flow through the layers of the net without really affecting anything.

The representation of every layer is just translated.