Let's start to formalize the problem of learning a good word embedding.

When you implement an algorithm to learn a word embedding,

what you end up learning is an embedding matrix.

Let's take a look at what that means.

Let's say, as usual we're using our 10,000-word vocabulary.

So, the vocabulary has A, Aaron,

Orange, Zulu, maybe also unknown word as a token.

What we're going to do is learn embedding matrix E,

which is going to be a 300 dimensional by 10,000 dimensional matrix,

if you have 10,000 words vocabulary or maybe 10,001 is our word token,

there's one extra token.

And the columns of this matrix would be the different embeddings

for the 10,000 different words you have in your vocabulary.

So, Orange was word number 6257 in our vocabulary of 10,000 words.

So, one piece of notation we'll use is that 06257 was

the one-hot vector with zeros everywhere and a one in position 6257.

And so, this will be a 10,000-dimensional vector with a one in just one position.

So, this isn't quite a drawn scale.

Yes, this should be as tall as the embedding matrix on the left is wide.

And if the embedding matrix is called capital E then notice that if you

take E and multiply it by just one-hot vector by 0 of 6257,

then this will be a 300-dimensional vector.

So, E is 300 by 10,000 and 0 is 10,000 by 1.

So, the product will be 300 by 1,

so with 300-dimensional vector and notice that

to compute the first element of this vector,

of this 300-dimensional vector,

what you do is you will multiply the first row of the matrix E with this.

But all of these elements are zero except for

element 6257 and so you end up with zero times this,

zero times this, zero times this, and so on.

And then, 1 times whatever this is,

and zero times this, zero times this, zero times and so on.

And so, you end up with the first element as whatever is that elements up there,

under the Orange column.

And then, for the second element of this 300-dimensional vector we're computing,

you would take the vector

0657 and multiply it by the second row with the matrix E. So again,

you have zero times this,

plus zero times this,

plus zero times all of these are the elements and then one times this,

and then zero times everything else and add that together.

So you end up with this and so on as you go down the rest of this column.

So, that's why the embedding matrix E times this one-hot vector here winds up

selecting out this 300-dimensional column corresponding to the word Orange.

So, this is going to be equal to E 6257 which

is the notation we're going to use to represent

the embedding vector that 300 by one dimensional vector for the word Orange.

And more generally, E for a specific word W,

this is going to be embedding for a word W. And more generally,

E times O substitute J,

one-hot vector with one that position J,

this is going to be E_J and that's going to be

the embedding for word J in the vocabulary.

So, the thing to remember from this slide is that

our goal will be to learn an embedding matrix E and what you see in the next video

is you initialize E randomly and you're straight in

the sense to learn all the parameters of this 300 by

10,000 dimensional matrix and E times this one-hot vector gives you the embedding vector.

Now just one note,

when we're writing the equation,

it'll be convenient to write this type of notation where you take

the matrix E and multiply it by the one-hot vector O.

But if when you're implementing this,

it is not efficient to actually implement this as

a mass matrix vector multiplication because the one-hot vectors,

now this is a relatively high dimensional vector and most of these elements are zero.

So, it's actually not efficient to use

a matrix vector multiplication to implement this because if

we multiply a whole bunch of things by zeros and so the practice,

you would actually use a specialized function to just look up

a column of the Matrix E rather than do this with the matrix multiplication.

But writing of the map, it is just convenient to write it out this way.

So, in Cara's for example there is

a embedding layer and we use the embedding layer then it

more efficiently just pulls out the column you want from the embedding matrix rather

than does it with a much slower matrix vector multiplication.

So, in this video you saw the notations were used to

describe algorithms to learning these embeddings and

the key terminology is

this matrix capital E which contain all the embeddings for the words of the vocabulary.

In the next video, we'll start to talk about specific algorithms

for learning this matrix E. Let's go onto the next video.