0:00

The next few videos are about using the back propagation algorithm to learn a

feature representation of the meaning of the word.

I'm gonna start with a very simple case from the 1980s, when computers were very

slow. It's a small case, but it illustrates the

idea about how you can take some relational information, and use the back

propagation algorithm to turn relational information into feature vectors that

capture the meanings of words. This diagram shows a simple family tree,

in which, for example, Christopher and Penelope marry, and have children Arthur

and Victoria. What we'd like is to train a neural

network to understand the information in this family tree.

We've also given it another family tree of Italian people which has pretty much the

same structure as the English tree. And perhaps when it tries to learn both

sets of facts, the neural net is going to be able to take advantage of that analogy.

The information in these family trees can be expressed as a set of propositions.

If we make up names for the relationships depicted by the trees.

So we're gonna use the relationships son daughter, nephew niece, father mother,

uncle aunt, brother sister and husband wife.

And using those relationships we can write down a set of triples such as, Colleen has

father James, Colleen has mother Victoria and James has wife Victoria.

And in the nice simple families depicted in the diagram, the third proposition

follows from the previous two. Similarly, the third proposition in the

next set follows from the previous two. So the relational learning task, that is,

the task of learning the information in those family trees, can be viewed as

figuring out the regularities in a large set of triples that express the

information in those trees. Now the obvious way to express

irregularities is as symbolic rules. For example, X has mother Y, and Y has

husband Z, implies X has father Z. We could search for such rules, but this

would involve a search through quite a large space, a combinatorially large

space, of discrete possibilities. A very different way to try and capture

the same information is to use a neural network that searches through a continuous

space of real valued weights to try and capture the information.

And the way it's going to do that is we're going to say it's captured the information

if it can predict the third terminal triple from the first two terms.

So at the bottom of this diagram here, We're going to put in a person and a

relationship and the information is going to flow forwards through this neural

network. And what we are going to try to get out of

the neural network after it's learned is the person who's related to the first

person by that relationship. The architecture of this net was designed

by hand, that is I decided how many layers it should have.

And I also decided where to put bottle necks to force it to learn interesting

representations. So what we do is we encode the information

in a neutral way, because there are 24 possible people.

So the block at the bottom of the diagram that says, local encoding of person one,

has 24 neurons, and exactly one of those will be turned on for each training case.

Similarly there are twelve relationships. And exactly one of the relationship units

will be turned on. And then for a relationship that has a

unique answer, we would like one of the 24 people at the top, one of the 24 output

people to turn on to represent the answer. By using a representation in which exactly

one of the neurons is on, we don't accidentally give the network any

similarities between people. All pairs of people are equally

dissimilar. So, we're not cheating by giving the

network information about who's like who. The people, as far as the network is

concerned, are uninterpreted symbols. But now in the next layer of the network,

we've taken the local encoding of person one, and we've connected it to a small set

of neurons, actually six neurons for this. And because there are 24 people, it can't

possibly dedicate one neuron to each person.

It has to re-represent the people as patterns of activity over those six

neurons. And what we're hoping is that when it

learns these propositions, the way in which thing encodes a person, in that

distributive panel activities. Will reveal structuring the task, or

structuring the domain. So what we're going to do is we're going

to train it up on 112 of these propositions.

And we go through the 112 propositions many times.

Slowly changing the weights as we go, using back propagation.

And after training we're gonna look at the six units in that layer that says

distributed encoding of person one to see what they are doing.

So here are those six units as the big gray blocks.

And I laid out the 24 people, with the twelve English people in a row along the

top, and the twelve Italian people in a row underneath.

So each of these blocks you'll see, has 24 blobs in it.

And the blobs tell you the incoming weights for one of the hidden units in

that layer. So going back to the previous slide.

If you look at that layer that says distributed and coding of person one.

There are six neurons there. And we're looking at the incoming weights

of each of those six neurons. If you look at the big gray rectangle on

the top right, you'll see an interesting structure to the weights.

The weights along the top that come from English people are all positive.

And the weights along the bottom are all negative.

That means this unit tells you whether the input person is English or Italian.

We never gave it that information explicitly.

But obviously, it's useful information to have in this very simple world.

Because in the family trees that we're learning about, if the input person is

English, the output person is always English.

And so just knowing that someone's English already allows you to predict one bit of

information about the output. Which is according to saying it halves the

number of possibilities. If you look at the gray blob immediately

below that, the second one down on the right, you'll see that it has four big

positive weights at the beginning. Those correspond to Christopher and Andrew

with our Italian equivalents. Then it has some smaller weights.

Then it has two big negative weights, that correspond to Collin, or his Italian

equivalent. Then there's four more big positive

weights, corresponding to Penelope or Christina, or their Italian equivalents.

And right at the end, there's two big negative weights, corresponding to

Charlotte, or her Italian equivalent. By now you've probably realized that, that

neuron represents what generation somebody is.

It has big positive weights to the oldest generation, big negative weight to the

youngest generation, and intermediate weights which are roughly zero to the

intermediate generation. So that's really a three-value feature,

and it's telling you the generation of the person.

Finally, if you look at the bottom gray rectangle on the left hand side, you'll

see that has a different structure, and if we look at the top row and we look at the

negative weights in the top row of that unit.

It has a negative weight to Andrew, James, Charles, Christine and Jennifer and now if

you look at the English family tree you'll see Andrew, James, Charles, Christine, and

Jennifer are all in the right hand branch of the family tree.

So that unit has learned to represent which branch of the family tree someone is

in. Again, that's a very useful feature to

have for predicting the output person, because if you know it's a close family

relationship, you expect the output to be in the same branch of the family tree as

the input. So the networks in the bottleneck have

learned to represent features of people that are useful for predicting the answer.

And notice, we didn't tell it anything about what features to use.

We never mentioned things like nationality or brunch or family tree or generation.

It figured out that those are good features for expressing the regularity in

this domain. Of course, those features are only useful

if the other bottlenecks, the one for relationships, and the one near the top of

the network before the output person, use similar representations.

And the central layer is able to say how the features of the input person and the

features of the relationship predict the features of the output person.

So for example if the input person is a generation three, and the relationship

requires the output person to be one generation up, then the output person is a

generation two. But notice to capture that rule, you have

to extract appropriate features at the first hidden layer, and the last hidden

layer of the network. And you have to make the units in the

middle, relate those features correctly. Another way to see that the network works,

is to train it on all but a few of the triples.

And see if it can complete those triples correctly.

So does it generalize? So there's 112 triples, and I trained it

on 108 of them and tested it on the remaining four, I did that several times

and it got either two or three of those right.

That's not so bad for a 24 way choice, so it's true it makes mistakes, but it didn't

have much training data, there's not enough triples in this domain to really

nail down the regularities very well. And it does much better than chance.

If you train it on a much bigger data set, it can generalize from a much smaller

fraction of the data. So if you have thousands and thousands of

relationships you only need to show a small percentage before it can start

guessing the other ones correctly. That research was done in the 1980s, and

was a way of showing that back-propagation could learn interesting features.

And it was a toy example. Now we have much bigger computers, and we

have databases of millions of relational facts.

Many of which of the form A, R, B, A has relationship R to B, we could imagine

training a net to discover feature vector representations of A and R, that allow it

to predict the feature vector representation of B.

If we did that, it would be a very good way of cleaning a database.

It wouldn't necessarily be able to make perfect predictions.

But it could find things in the database that it thought were highly implausible.

So if the database contained information, like, for example, Bach was born in 1902.

It could probably realize that was wrong, 'cuz Bach's a much older kind of person,

and everything else he's related to is much older than 1902.

Instead of actually using the first two terms to predict the third term, we could

use the whole set of terms, three of them in this case, but possibly more, and

predict the probability that the fact is correct.

To train a net to do that, we'd need examples of a whole bunch of correct

facts, and we'd ask it to give a high output.

We'd also need a good source of incorrect facts, and we'd ask it to give a low

output when we're told it was something that was false.