0:00

In this video, we're going to look at the issue of training deep autoencoders.

People thought of these a long time ago, in the mid 1980s.

But they simply couldn't train them well enough for them to do significantly

better than principal components analysis.

There were various papers published about them, but no good demonstrations of

impressive performance. After we developed methods of

pre-training deep networks one layer at a time.

Russ Salakhutdinov and I applied these methods to pretraining deep autoencoders,

and for the first time, we got much better representations out of deep

autoencoders than we could get from principal components analysis.

Deep autoencoders always seemed like a really nice way to do dimensionality

reduction because it seemed like they should work much better than principal

components analysis. They provide flexible mappings in both

directions, and the mappings can be non-linear.

Their learning time should be linear or better in the number of training cases.

And after they've been learned, the encoding part of the network is fairly

fast because it's just a matrix multiplier for each layer.

Unfortunately, it was very difficult to optimize deep autoencoders using back

propagation. Typically people try small initial

weights, and then the back propagated gradient died, so for deep network, they

never got off the ground. But now we have much better way to

optimize them. We can use unsupervised layer by layer

pre-training, or we can simply initialize the weight

sensibly, as an echo statement it works. The first really successful deep water

encoders were learned by Russ Salakhutdinov and I in 2006.

We applied them to the N-ness digits. So we started with images with 784

pixels. And we then encoded those via three

hidden layers, into 30 real valued activities in a central code layer.

We then decoded those 30 real valued activities,

back to 784 reconstructed pixels. We used a stack of restricted Boltzmann

machine to initialize the weights used for encoding,

and we then took the transposers of those weights and initialized the decoding

network with them. So initially, the 784 pixels were

reconstructed, using a weight matrix that was just the transpose of the weight

matrix used for encoding them. But after the four restricted Boltzmann

machines have being trained and unrolled to give the transposes for decoding,

we then applied back propagation to minimize the reconstruction error of the

784 pixels. In this case we were using a

cross-entropy error, because the pixels were represented by logistic units.

So that error was back propagated through this whole deep net.

And we once started back propagating the error,

the weights used for reconstructing the pixels became different from the weights

used for encoding the pixels. Although they, typically stayed fairly

similar. This worked very well.

3:24

So if you look at the first row, that's one random sample from each digit class.

If you look at the second row, that's the reconstruction of the random sample by

the deep autoencoder that uses 30 linear hidden units in its central layer.

So the data has been compressed to 30 real numbers and then reconstructed.

If you look at the eight, you can see that the reconstruction is actually

better than the eight. It's got rid of the little defect in the

eight because it doesn't have the capacity to encode it.

If you compare that with linear principal commands analysis, you can see it's much

better. A linear mapping to 30 real numbers

cannot do nearly as good a job of representing the data.