0:00

In this video, you'll start to learn

some concrete algorithms for learning word embeddings.

In the history of deep learning as applied to learning word embeddings,

people actually started off with relatively complex algorithms.

And then over time,

researchers discovered they can use

simpler and simpler and simpler algorithms and still

get very good results especially for a large dataset.

But what happened is,

some of the algorithms that are most popular today,

they are so simple that if I present them first,

it might seem almost a little bit magical,

how can something this simple work?

So, what I'm going to do is start off with some of

the slightly more complex algorithms because I think it's actually

easier to develop intuition about why they should work,

and then we'll move on to simplify these algorithms and show

you some of the simple algorithms that also give very good results.

So, let's get started.

Let's say you're building

a language model and you do it with a neural network.

So, during training, you might want your neural network to do something like input,

I want a glass of orange,

and then predict the next word in the sequence.

And below each of these words,

I have also written down the index in the vocabulary of the different words.

So it turns out that building

a neural language model is the small way to learn a set of embeddings.

And the ideas I present on this slide were due to Yoshua Bengio,

Rejean Ducharme, Pascals Vincent, and Christian Jauvin.

So, here's how you can build a neural network to predict the next word in the sequence.

Let me take the list of words,

I want a glass of orange,

and let's start with the first word I.

So I'm going to construct one add vector corresponding to the word I.

So there's a one add vector with a one in position, 4343.

So this is going to be 10,000 dimensional vector.

And what we're going to do is then have a matrix of parameters E,

and take E times O to get an embedding vector e4343,

and this step really means that

e4343 is obtained by the matrix E times the one add vector 43.

And then we'll do the same for all of the other words.

So the word want, is where 9665 one add vector,

multiply by E to get the embedding vector.

And similarly, for all the other words.

A, is a first word in dictionary,

alphabetic comes first, so there is O one, gets this E one.

And similarly, for the other words in this phrase.

So now you have a bunch of three dimensional embedding,

so each of this is a 300 dimensional embedding vector.

And what we can do,

is fill all of them into a neural network. So here is the neural network layer.

And then this neural network feeds to a softmax,

which has it's own parameters as well.

And a softmax classifies among the 10,000

possible outputs in the vocab for those final word we're trying to predict.

And so, if in the training slide we saw the word juice then,

the target for the softmax in training repeat that it should predict

the other word juice was what came after this.

So this hidden name here will have his own parameters.

So have some, I'm going to call this W1 and there's also B1.

The softmax there was this own parameters W2, B2,

and they're using 300 dimensional word embeddings,

then here we have six words.

So, this would be six times 300.

So this layer or this input will be a 1,800 dimensional

vector obtained by taking your six embedding vectors and stacking them together.

Well, what's actually more commonly done is to have a fixed historical window.

So for example, you might decide that you always want to predict

the next word given say the previous four words,

where four here is a hyperparameter of the algorithm.

So this is how you adjust to

either very long or very short sentences or you decide to

always just look at the previous four words,

so you say, I will still use those four words.

And so, let's just get rid of these.

And so, if you're always using a four word history,

this means that your neural network will input a 1,200 dimensional feature vector,

go into this layer,

then have a softmax and try to predict the output.

And again, variety of choices.

And using a fixed history, just means that you can deal with even arbitrarily

long sentences because the input sizes are always fixed.

So, the parameters of this model will be this matrix E,

and use the same matrix E for all the words.

So you don't have different matrices for

different positions in the proceedings four words,

is the same matrix E. And then,

these weights are also parameters of the algorithm

and you can use that crop to perform gradient to sent

to maximize the likelihood of

your training set to just repeatedly predict given four words in a sequence,

what is the next word in your text corpus?

And it turns out that this algorithm we'll learn pretty decent word embeddings.

And the reason is, if you remember our orange juice, apple juice example,

is in the algorithm's incentive to learn

pretty similar word embeddings for orange and apple

because doing so allows it to fit

the training set better because it's going to see orange juice sometimes,

or see apple juice sometimes, and so,

if you have only a 300 dimensional feature vector to represent all of these words,

the algorithm will find that it fits the training set fast.

If apples, oranges, and grapes, and pears,

and so on and maybe also durians which is

a very rare fruit and that with similar feature vectors.

So, this is one of the earlier and pretty successful algorithms

for learning word embeddings,

for learning this matrix E. But now let's

generalize this algorithm and see how we can derive even simpler algorithms.

So, I want to illustrate the other algorithms

using a more complex sentence as our example.

Let's say that in your training set,

you have this longer sentence,

I want a glass of orange juice to go along with my cereal.

So, what we saw on the last slide was

that the job of the algorithm was to predict some word juice,

which we are going to call the target words,

and it was given some context which was the last four words.

And so, if your goal is to learn

a embedding of researchers I've experimented with many different types of context.

If it goes to build a language model then is

natural for the context to be a few words right before the target word.

But if your goal is into learn the language model per se,

then you can choose other contexts.

For example, you can pose a learning problem

where the context is the four words on the left and right.

So, you can take the four words on the left and right as the context,

and what that means is that we're posing a learning problem

where the algorithm is given four words on the left.

So, a glass of orange,

and four words on the right,

to go along with,

and this has to predict the word in the middle.

And posing a learning problem like this where you have the embeddings of

the left four words and the right four words feed into a neural network,

similar to what you saw in the previous slide,

to try to predict the word in the middle,

try to put it target word in the middle,

this can also be used to learn word embeddings.

Or if you want to use a simpler context,

maybe you'll just use the last one word.

So given just the word orange,

what comes after orange?

So this will be different learning problem where you tell it one word,

orange, and will say well,

what do you think is the next word.

And you can construct a neural network that just fits in the word,

the one previous word or the embedding

of the one previous word to a neural network as you try to predict the next word.

Or, one thing that works surprisingly well is to take a nearby one word.

Some might tell you that, well,

take the word glass,

is somewhere close by.

Some might say, I saw

the word glass and then there's another words somewhere close to glass,

what do you think that word is?

So, that'll be using nearby one word as the context.

And we'll formalize this in the next video but this is the idea of a Skip-Gram model,

and just an example of a simpler algorithm where the context is now much simpler,

is just one word rather than four words,

but this works remarkably well.

So what researchers found was that if you really want to build a language model,

it's natural to use the last few words as a context.

But if your main goal is really to learn a word embedding,

then you can use all of these other contexts and they will

result in very meaningful work embeddings as well.

I will formalize the details of

this in the next video where we talk about the Walter VEC model.

To summarize, in this video you saw how the language modeling problem

which causes the pose of machines learning problem where you input

the context like the last four words and predicts some target words,

how posing that problem allows you to learn input word embedding.

In the next video,

you'll see how using even simpler context and

even simpler learning algorithms to mark from context to target word,

can also allow you to learn a good word embedding.

Let's go on to the next video where we'll discuss the Walter VEC.