Last video, you learned about word vectors and you saw that they can capture important properties of words. This week, you will make use of word vectors to learn to align words in two different languages, which will give you your first basic translation program. Later, I'll teach you locality sensitive hashing to speed it up. Let's first get an overview of machine translation by starting with an example of English to French translation. In order to translate an English word to a French word. One way would be to generate an extensive list of English words and their associated French word. If you ask the human to do this, you would find someone who knows both languages to start making a list. If you want a machine to learn how to do this, you would calculate word embeddings associated with English and word embeddings associated with French. Next retrieve the English word embedding of a particular English word such as cat. Then find some way to transform the English word embedding into word embedding that has the meaning in the French word vector space. We will see how to convert from the English word vector space to the French word vector space in the moment. Next, you'll take the transformed word vector and search for word vectors and the French word vector space that are most similar to it. The most similar words are candidates words for your translation. If your machine does a good job, it may find the word chat which is the French word for cat. You want to find the matrix that can do this transformation for you. So let's talk about transforming vectors using matrices. To try this out for yourself, you can write the following code which is also in the notebook associated with this lecture. Define the matrix R, then defined the vector x, multiply x by R using numpy dot and the result is another two dimensional vector which is what you saw earlier. Please try this out for yourself. Now that we know that there can be matrix that transforms our English word vectors into relevant French word vectors. How do we define this transformation matrix which will denote as R? We can start with a randomly selected matrix R and then see how it performs. When you try to translate the English vectors in matrix X and compare that to the actual French word vectors which is in the matrix Y. In order for this to work, you will first need to get a subset of English words and their French equivalents. Get their respective word vectors and stack the word vectors in the respective matrices, X and Y. The key here is to keep the roads lined up, or to align the word vectors. This means that if the first row of matrix X contains the word cat, then the first row of the matrix Y should contain the French word for cat which is chat. Now you may be asking wait a minute. If I already have the English words and their French translations, why do I need to train a model to do this? Why not just save this information in a key value mapping like a python dictionary? Well, the nice thing is that you can just collect a subset of these words to find your transformation matrix. And if it works well, then the model can be used to translate words that are not part of your original training set. So you only need to train on a subset of the English French vocabulary and not the entire vocabulary. So let's see how to find the good matrix R. First we compare the translation X times are with the actual French word embeddings in Y. We do this by taking the X matrix times the R matrix and subtracting the Y matrix. I'll explain in more detail what this expression means and also what this capital F subscript means. But for now think of it as a measure of how far apart the attempted translation and the actual French vectors are. If you start with a random matrix R, you can gradually improve these matrix R in a loop. First compute the gradient by taking the derivative of this loss function with respect to the matrix R. Next update the matrix R by subtracting the gradient, but note that it's the gradient weighted by the learning rate alpha. You can either pick a fixed number of times to go through the loop or check the loss at each iteration. And break out of the loop when the loss falls between a certain threshold. Now let's explain what this notation, what the double vertical lines means. This is measuring the magnitude or the norm of a matrix. Let's see an example of calculating this norm, and then see the general formula. Let's say that the results of X times R minus Y is a matrix. We'll pretend for this example that there's only two words in this dictionary, which is the number of rows in the matrix. And the word embeddings have two dimensions. So that's the number of columns in the matrix. So matrices X, R, Y and A are all 2 by 2 matrices. If the matrix A looks like this, then to calculate its norm, we take two squared plus two squared plus two squared plus two squared, then take square roots. This gives us four. Here is the actual formula. You just take all the elements in the matrix, square them and add them up. This norm has the subscript F. Because this is called the frobenius norm. Now let's calculate the frobenius norm in code. You start with a matrix A, use numpy dot square to square all the elements. Then use numpy dot some, then numpy dot square roots to get four. Try it out for yourself. Know that in practice it's easier to minimize the square of the frobenius norm. In other words, we can cancel out the square roots by taking the square. I'll explain in a moment why it's easier to work with the square of the norm. If we go back to our example with matrix A, the squared frobenius norm is just two squared plus two squared plus two squared plus two squared. Then we take the square root, but then cancel it out by squaring that some. So the square of the frobenius norm is 16. Now let's also go into detail on how you can calculate the gradient of the loss function. The loss is defined as the square of the frobenius norm, the gradient is a derivative, the laws with respect to the matrix R. If it looks like this, the scalar, M is the number of rows or words in the subset that we're using for training. If you remember from calculus, this may look familiar to you if you pretend that R as a single variable instead of a matrix. And x and y are constants. If you don't recognize this, don't worry, you won't need to know calculus here. You'll be able to look up these derivatives online if you ever need to. Now going back to Y, it helps to use the square of the frobenius norm. It's easier to take the derivative of this expression, rather than dealing with the square roots that's in the frobenius norm. You'll implement this formula in the assignment. You have seen how with just one matrix, you can learn to align word vectors from one language to another. We will next look at the nearest neighbors.