0:00
In this video, I'm going to talk about applying deep autoencoders to document
retrieval. There was a method developed some time
ago called latent semantic analysis, that amounts to applying principal components
analysis to vectors of word counts extracted from documents.
The codes produced by latent semantic analysis can then be used for judging
similarity between documents so they can be used for document retrieval.
Obviously, if deep autoencoders work much better than PCA, we would expect to be
able to extract much better codes using a deep autoencoder than using latent
semantic analysis. And Rus-lan Salakhutdinov and I showed
that, that was indeed the case. Using a big database of documents, we
showed that ten components extracted with a deep autoencoder are actually worth
more than 50 components extracted with a linear method, like latent semantic
analysis. We also showed that if you make the code
very small, having just two components, you can use those two components for
visualizing documents as a point in a two-dimensional map.
And this, again, works much better than just extracting the first two principal
components. To find documents that are similar to a
query document, the first thing we do is convert each
document into a big bag of words. In other words, we have a vector of word
counts that ignores the order of the words.
This clearly throws away quite a lot of information.
But it also retains a lot of information about the topic of the document.
We ignore words like the or over" which are called stop words," because they
don't have much information about the topic.
So if you look on the right, I've done the counts for various words, and they're
actually the counts for the document on the left.
So, if you look at what words we have nonzero counts for,
they are vector, and count, and query, and reduce, and bag, and a word,
that tells you quite a lot about what the document is about.
Now, we could compare the word counts of the query document with the word counts
of millions of other documents. But that would involve comparing quite
big vectors. In fact, we use vectors of size 2000.
So, that would be slow. Alternatively, we could use each query
vector to a much smaller vector that still contained most of the information
about the content. So, here's how we do the reduction.
We take the deep autoencoder and we compress the 2,000-word counts down to
ten real numbers, from which we can reconstruct the 2,000-word counts,
although we can't reconstruct them very well.
We train the neural network to reproduce its input vector as its output vector as
well as possible. And that forces it to put as much
information about the input into those ten numbers as possible.
We can then compare documents using just ten numbers.
That's going to be much faster. So, there's one problem with this, which
is word counts aren't quite the same as pixels or real values.
What we do is we divide the counts in a bag of words by the total number of
non-stop words and that converts the vector of counts into a probability
vector where the numbers add up to one. You can think of it as the probability of
getting a particular word if you picked a word at random in the document, as long
as that is not a stop word. So, the output of the autoencoder, we're
using a great, big 2,000-way softmax. And our target values are the
probabilities of words when we reduce the count vector to a probability factor.
There's one further trick we have to do. We treat the word counts as probabilities
when we're reconstructing them. But when we're using them to activate the
first hidden layer, we multiply all the weights by N.
And that's because we have N different observations from that probability
distribution. If we left them as probabilities, the
input units would have very small activities and wouldn't provide much
input to the first hidden layer. So, we have this funny property that for
the first restricted Boltzmann machine, the bottom up weights, are N times bigger
than the top down weights. So, how well does this work?
We trained using bags of 2,000 words on 4,000 business documents from the Reuters
data set. And these documents had been hand
labelled with about a 100 different categories.
We first trained a stack of restricted Boltzmann machines, and then, we fine
tuned with back propagation using a 2,000-way softmax as the output.
And then, we test it on a different set of 4,000 documents.
5:09
And to test, you pick one document to be the query, one of the test documents,
and then you rank order all the other test documents by using the cosine of the
angles between the ten-dimensional vectors that the ordering code gives you.
You repeat this for each of the 4,000 possible test documents and then you plot
the number of documents you're going to retrieve, that is how far down that rank
list you're going to go, against the proportion that are in the same hand
labelled class as the query document. This is not a very good measure of the
quality of the retrieval. But we're going to use the same measure
for comparing the LSA. And so, at least, it's a fair comparison.
So, here's the accuracy of the retrieval as a function of the number of retrieved
documents. When you see that an autoencoder was just
using a code with ten real numbers is doing better than latent emantic
analysis, using 50 real numbers. And, of course, it's five times less work
per document after you've got the code. Latent semantic analysis with ten real
numbers is much worse. We can also do the same thing where we
reduce to two real numbers, and then, instead of doing retrieval, we're just
going to plot all the documents in a map but we're going to color that
two-dimensional point that corresponds to the two numbers produced by PCA by the
class of the document. So, we took the major classes of the
documents. We gave those major classes different
colors. And then, we used PCA on log of one plus
the count. The point of doing that is that it
suppresses counts with very big numbers which tends to make PCA work better.
7:11
This is the distribution you get. As you can see, there is some separation
of the classes. The green class is in one place.
The red class is in a slightly different place.
But the classes are very mixed up. Then, we did the same thing by using a
deep autoencoder to reduce the documents to two numbers, and, again, plotting the
documents in a two-dimensional space using those two numbers as the
coordinates. And here's what we got.
It's a much better layout. It tells you much more about the
structure of the data set. You can see the different classes, and
you can see that they're quite well-separated.
We assume that the documents in the middle are ones which didn't have many
words in them, and therefore, it was hard to distinguish between the classes.
A visual display like this could be very useful. If, for example, you saw one of
green dots was the accounts and earning reports from Enron, you probably wouldn't
want to buy shares in a company that has a green dot nearby.