Welcome, this week, I will talk about long short term memory cells, which we call LSTMs. To understand why they are important, let me explain the vanishing gradient problem, so let's dive in. >> I'll introduce you to vanishing/exploding gradients, a problem common to RNNs, and then demonstrate a few ways to handle them. Let's begin with a discussion of some of the pros and cons of using a recurrent neural network. For one, the way plain or vanilla RNN model sequences by recalling information from the immediate past, allows you to capture dependencies to a certain degree, at least. They're also relatively lightweight compared to other n-gram models, taking up less RAM and space. But there are downsides, the RNNs architecture optimized for recalling the immediate past causes it to struggle with longer sequences. And the RNNs method of propagating information is part of how vanishing/exploding gradients are created, both of which can cause your model training to fail. Vanishing/exploding gradients are a problem, this can arise due to the fact that RNNs propagates information from the beginning of the sequence through to the end. Starting with the first word of the sequence, the hidden value at the far left, the first values are computed here. Then it propagates some of the computed information, takes the second word in the sequence, and gets new values. You can see that process illustrated here, the orange area denotes the first computed values, and the green denotes the second word. The second values are computed using the older values in orange, and the new word in green. After that, it takes the third word and propagates the values from the first and second words and computes another set of values from both of those, and it continues in a similar way from there. At the final step, the computations contain information from all the words in the sequence, and the RNN is able to predict the next word, which in this example is goal. Know that in an RNN, the information from the first step doesn't have much influence on the outputs. This is why you can see the orange portion from the first step decreasing with each new step. Correspondingly, computations made at the first step don't have much influence on the cost function either. The gradients are calculated during back propagation, or the process of moving backwards towards the initial layer from the final layer that was reached during the forward pass. The derivatives from each layer are then multiplied from back to front in order to compute the derivative of the initial layer. You can think of gradients as a measure of how much a model can improve over a series of time steps. When your network performs back propagation, it's weights receive an update that's proportional to the gradients with respect to the current weights of that time step. But in the network with many time steps or layers, having a gradient arrived back at the early layers as the product of all the terms from the later layers makes for an inherently unstable situation. Especially if the values have become so small, that's why it no longer updates properly. For the problem of exploding gradients, imagine this process working in the opposite direction, as the updated weights become so large that they cause the whole network to become unstable. This situation can lead to numerical overflow. Now that you're appropriately terrified of vanishing/exploding gradients, let's discuss some solutions. I won't spend a whole lot of time on this since this week focuses on a model approach that was designed to mitigate this problem. You can deal with vanishing gradients by initializing your weights to the identity matrix, which carries values of 1 along the main diagonal and 0 everywhere else, and using a ReLU activation. What this essentially does is copy the previous hidden states, add information from the current inputs, and replace any negative values with 0. This has the effects of encouraging your network to stay close to the values in the identity matrix, which act like 1s during matrix multiplication. This method is referred to, unsurprisingly, as an identity RNN. The identity RNN approach only works for vanishing gradients through as the derivative of ReLU is equal to 1 for all values greater than 0. To account for values growing exponentially, you can perform gradient clipping. To clip your gradients, simply choose a relevant value that you clip the gradients to, say 25. Using this technique, any value greater than 25 will be clipped to 25, this serves to limit the magnitude of the gradients. Finally, skip connections provide a direct connection to the earlier layers. This effectively skips over the activation functions and adds the value from your initial inputs x to your outputs, or F(x) + x. This way, activations from early layers have more influence over the cost function. >> Now you understand how RNNs can have a problem with vanishing gradients. Next, I will show you a solution, the LSTM, let's go to the next video.