Hi guys. And welcome back to our Deep Learning Course. My name is Kate, and this week we'll speak about Deep Learning Architectures for sequential data. First, let's understand in which tasks do we really need to work with sequential data. I think all of you know at least several examples of such data. Text, Video, and Audio, you use them everyday, and all of them are sequences. Text is a sequence of sentences or words or even letters. Video is a sequence of images. And audio is a sequence of sounds. But these are only the most common examples of such data. There are a lot more of them. You can find time series in a lot of different areas, in finance, in industry, in medicine, and so on. And time series, of course, are also sequences in time. For example, in medicine, all of the data from medical sensors are usually sequential. So as you can see, there are a lot of different tasks with sequential data. And this week, we'll learn how to use Deep Learning to solve these tasks. First, I will introduce you to a specific example of simple task with sequential data called Language model. In this task we want to train a generative model of natural language. So, we want to be able to generate text from the probability distribution P of text. This probability distribution shows us which text is probable in our language, and which text is actually not a text, but only a sequence of random words. Here, we suppose that we generate texts word by word. So, the probability distribution of our text, the probability of our text is actually a product of probabilities of the words from this text. All of these probabilities are conditional probabilities of current word given all the previous words. We can actually use Language model in all the different tasks in practice. For example, we can use it to generate answers in chatbots or question answering systems. But of course, here, we should condition our Language model on the previous conversation, because the model should know what the question was about to answer it. Also, Language model can be useful in machine translation or speech recognition to understand which phrase is more natural to a language. And actually, we can use Language model in a lot of other tasks about text analysis. Now, let's try to use a standard multilayer perceptron to construct our Language model. Let's say that we've already generated i words, and we want to generate the next word, i + 1. Then, we can use i words we have as an input on our MLP. Of course, here, we need to use some numerical features for these words, for example one-hot encodings or word2vec embeddings. And as an output of our MLP we can have a distribution over our vocabulary, which says which word is more probable to be the next one in our sequence. Okay, this model will work. And now, we have i + 1 words, and we want to generate the next word, i + 2. And what we want to do, we actually want to give all the words we have as an input to our MLP. But we can't do it because MLP can work only with fixed number of inputs. So here we see the main problem of using standard MLP with sequential data. The problem is that sequences can have different lengths. So, what should we do with this? We can use a simple heuristic and give not all of the previous words as an input to our MLP, but only a fixed number of them. So, we have a window of fixed size as an input. This technique works, but it's only a heuristic, and it is not clear how to choose the size of the window for different tasks in practice. And in some tasks, we actually need a very wide window to work with. For example, let's say that we want to generate the story. And we write it on the first page that we have a character named King Louis, and now we're on the fifth page. And we generated word King, and it's clear that the next word should be Louis. But our model understands this only if the window is more than five pages, and five pages is very big window, there are a lot of words in them. So, in this case, our MLP has a lot of parameters. And that's a problem. Let's, for example, calculate how many parameters are in the first layer of our MLP if it contains 100 hidden neurons. We use window of size 100, and all of our words are described to be of size 100. Yes, there are more than a million parameters only in the first layer of our network. And here, you need to understand that these characteristics are actually not that large. I mean, the network with 100 hidden neurons, is a very small network. In practice we need to be able to work with much larger networks. So that's a problem. To overcome this problem we can use a recurrent architecture. In this architecture, we work with a sequence word by word. So we see only one word in one time step. Here, our MLP has not only one output which it transfers to the next layer, but also the second output which it transfers to the next time step. Therefore, now it's not necessary to our MLP to use all of the previous words as an input at the same time. It's enough to take only one current word, but also to take it's own second additional output from the previous time step. As a result, our MLP now has the fixed number of parameters. But still, all the information about the previous words in the sequence. So now, we can work with the sequences of arbitrary length. Of course, here, we need to understand that at first step we need to use some initial vector as an output from the previous time step, because we don't have previous time step. Additionally, the dependence between next word that we want to generate and all of the previous ones is actually independent of the time step. So we can use this same MLP at each time step. As a result, all the parameters of our MLP shared between time steps, and we need much less number of parameters. Now lets return to our example. Let's calculate the number of parameters in the first layer of our MLP if it contains 100 hidden neurons. All of the words are described with word embeddings of size 100, and the think that MLP transfer from one step to another is these exact 100 hidden neurons from the first layer. Yeah, now we have only 20,000 of parameters. That's two orders of magnitude less than earlier, so this technique works. Now, let's summarize what we've learned in this video. Now we know that there are a lot of different tasks with sequential data, and we need to know how to work with them. And we know that feedforward neural networks are not very useful in such tasks because of arbitrary lengths of sequences, and also because of large number of parameters. And here we can use recurrent architecture which is much more useful. In the next video we will speak about Simple Recurrent Neural Network and how to train it. [MUSIC]