Now, okay, so basically, those models are kind of symmetric.

And they're non-similar representations up to, well, some minor changes.

So the general idea stays the same.

And you can, again,

use one of those two matrices as your word-embedding matrix now.

Because for example, in the word-to-context model, what you had is,

this first matrix was, for every single possible sample,

only one row of this matrix was kind of used at a time.

So basically, you could assume that this kind of of row in the matrix is the vector

corresponding to your word and use this matrix as your word embedding.

If you train this model by yourself or

if you use a pretrained model, you'd actually notice that it has

a lot of peculiar properties on top of what we actually wanted it to have.

So of course, it does what we actually trained it for.

It trains similar vectors for synonyms,

and different vectors for semantically different words.

But there's also a very peculiar effect called kind of linear algebra,

word algebra.

For example, if you take a vector of king, then subtract from it the vector of man

and the vector of woman, it gets something very close to the vector of queen.

So kind of, king minus man plus woman equals queen.

Or another example, you could take moscow minus russia plus france equals paris.

And they kind of make sense, well,

they're kind of underdefined, in mathematical terms.

And this is just a side effect of the smaller training.

So, again, like other models we've studied previously,

this is not a desired, kind of originally intended effect.

But it's very kind of interesting and sometimes it's even helpful for

applications of these word embedding models.

Now, if you visualize those word vectors, for example, if you take first two

principal components of your trained word vectors, it also emerges that this linear

algebraic stands very nicely to kind of structured space of those word embeddings.

For example, in many cases, you may expect a kind of similar direction

vector connecting all countries to their corresponding capitals.

Or all male profession names to the corresponding female profession names.

So there's a lot of those particular properties.

Of course, you cannot expect them to be 100% certainly trainable.

So sometimes you get the desired effects, sometimes you just get rubbish.

And of course, the model doesn't strictly apply to the exact same distance have

to be preserved through, it just trains something peculiar by the way it trains.

And this kind of coincides with the idea that, for example, autoencoders and

other unsupervised learning methods have a lot of kind of unexpected properties that

they all satisfy.

So hopefully by now I managed to convince you that having those word vectors

around is really convenient, or at least cool,

because they have all those nice properties.

It's later going to turn out those word vectors are really crucial for

some other deep learning applications to natural language processing,

like recurrent neural networks.

But before we cover that, let's actually find out, how do we train them,

how do we obtain those vectors to start collecting the benefits from them.