So we use y to represent the target word.

And we use a one-hot representation for y hat and y here.

Then the lost would be The negative log

liklihood, so sum from i equals 1

to 10,000 of yi log yi hat.

So that's a usual loss for softmax where

we're representing the target y as a one hot vector.

So this would be a one hot vector with just 1 1 and the rest zeros.

And if the target word is juice, then it'd be element 4834 from up here.

That is equal to 1 and the rest will be equal to 0.

And similarly Y hat will be a 10,000 dimensional vector output by the softmax

unit with probabilities for all 10,000 possible targets words.

So to summarize, this is the overall little model, little neural

network with basically looking up the embedding and then just a soft max unit.

And the matrix E will have a lot of parameters, so the matrix E has parameters

corresponding to all of these embedding vectors, E subscript C.

And then the softmax unit also has parameters that gives the theta

T parameters but if you optimize this loss function with respect to the all of

these parameters, you actually get a pretty good set of embedding vectors.

So this is called the skip-gram model because is taking as input one word

like orange and then trying to predict some words skipping a few words from

the left or the right side.

To predict what comes little bit before little bit after the context words.

Now, it turns out there are a couple problems with using this algorithm.

And the primary problem is computational speed.

In particular, for the softmax model, every time you want to evaluate this

probability, you need to carry out a sum over all 10,000 words in your vocabulary.

And maybe 10,000 isn't too bad, but

if you're using a vocabulary of size 100,000 or a 1,000,000,

it gets really slow to sum up over this denominator every single time.

And, in fact, 10,000 is actually already that will be quite slow, but

it makes even harder to scale to larger vocabularies.

So there are a few solutions to this, one which you see in the literature

is to use a hierarchical softmax classifier.

And what that means is,

instead of trying to categorize something into all 10,000 carries on one go.

Imagine if you have one classifier,

it tells you is the target word in the first 5,000 words in the vocabulary?

Or is in the second 5,000 words in the vocabulary?

And lets say this binary cost that it tells you this is in the first 5,000

words, think of second class to tell you that this in the first 2,500

words of vocab or in the second 2,500 words vocab and so on.

Until eventually you get down to classify exactly what word it is, so

that the leaf of this tree, and so having a tree of classifiers like this,

means that each of the retriever nodes of the tree can be just a binding classifier.

And so you don't need to sum over all 10,000 words or

else it will capsize in order to make a single classification.

In fact, the computational classifying tree like this

scales like log of the vocab size rather than linear in vocab size.

So this is called a hierarchical softmax classifier.

I should mention in practice, the hierarchical softmax classifier doesn't

use a perfectly balanced tree or this perfectly symmetric tree,

with equal numbers of words on the left and right sides of each branch.

In practice, the hierarchical software classifier can be developed so

that the common words tend to be on top,

whereas the less common words like durian can be buried much deeper in the tree.

Because you see the more common words more often, and so

you might need only a few traversals to get to common words like the and of.

Whereas you see less frequent words like durian much less often, so it says

okay that are buried deep in the tree because you don't need to go that deep.

So there are various heuristics for

building the tree how you used to build the hierarchical software spire.