Hey, everyone. In this lesson, we are going to discuss sequence tagging task. We'll start with some examples and then investigate one model that can be used to solve it. It will be called Hidden Markov Model. Let us get started. So the problem is as follows. You are given a sequence of tokens, and you would like to generate a sequence of labels for them. The examples would be part of speech tagging, named entity recognition, or semantic slot filling that we have briefly seen in introduction. Now, for example, I am given a sequence of words, and they want to produce part-of-speech tags like here. For example, saw is that the verb and I is the pronoun and so on. There are different systems to list all possible tags and different systems to decide what are important tags and not Important tags. One system is here, and you can see that there are some open class words and closed class words and some other labels. For example, we have also here some punctuation and symbols for punctuation if we see it in the texts. Another example would be named entity recognition. So, here, I have a sequence and last Friday or Winnie-the-Pooh would be some named entities. Sometimes we really need to find them in the texts and use them as features, or maybe we need them to generate answer to some question. What kind of named entities can we see? First of all, it would be some persons or organization names or locations but not only them. For example, we can also have dates and the units and any other entities that you see in the texts. What kind of approaches work well for these type of tasks? First, it would be rule-based approach, and we do not cover it in details here. And the second one would be just to take classifiers like naive Bayes classifier or maybe a logistic regression and use them separately at each position to classify the labels for this position. This is obviously not super nice because you do not use the information about sequence, right? So a better idea would be to do sequence modeling, and this is what is our lesson about. And another idea would be to do neural networks, and you will hear about them in another lesson. To come to our models, to hidden Markov models, let us define everything properly. So please remember this notation, and let us go through it. So we have a sequence. It will be a denoted by x, and we have a sequence of tags, which would be denoted by y. The length of the sequence is big T, and by T, we will denote the positions in our sequence. Now, the task is to produce y given x, to produce the most probable y. So we need to find the most probable tags for our words in our model, but to find something most probable, we should first define the model and see what are the probabilities here. So we are going to define the joint probability of x and y. Do you understand the equation in the bottom of the slide? So take a moment to see that, actually, the right-hand side and the left-hand side can be both used to find the argmax just because they are different just by probability of x, which does not depend on y. So if I multiply the left side by probability of x, it is just constant for me. So I don't care whether I do it or not. Let us define the model. This is probably the most important slide of our video because it tells us what is the hidden Markov model. So we need to produce the probabilities of x and y. x are our observable variables and y are our hidden variables. Now, first, we apply product rule to decompose it into two parts. Then every part should be farther simplified, and we can do this by applying two different assumptions. The first assumption, which is called Markov assumption, will help us to model the probability over the sequence of tags. So we say that this sequence can be factorized into probabilities of separate pieces so we just model the probabilities of the next tag given the previous tag. You have already seen this in the y-gram language models. So there, we applied exactly the same formula to model the probabilities of words. Now, we do this to model the probabilities of tags. Also there, maybe you remember that we had to have some special start token so that we could write down this like that and be with the fact that we can have T equal to zero. So, here, the first term would be the probability of the first tag given some special zero tag, which is just the start tag. Awesome. Now, the second assumption is about output probabilities. So we need to produce the probabilities of x given y and it is rather complicated. But we can say that we will factorize it into the probabilities for separate positions. And for every position, we will have the probability of the current word given the current tag. So given these two assumptions, you can write down the formula in the top of the slide. And this is the definition of hidden Markov model. Please take a moment to remember it. Now, hidden Markov model can be used to model texts. So it can be used to generate texts. Let us see how it happens. So, first, we'll need to generate a sequence of tags, and after that, we will generate some words given current tags. For example, we will start with our fake start tag, and then we'll generate some tags using those transition probabilities. So we generate them and continue like that, and after we are done, we start to generate words given this tags. So we generate a sequence, and we see that the model has generated some nice example from Winnie The Pooh here. Now, let us define once again what are the parameters of the model. So hidden Markov model is defined by the following formal five steps. So, first, we have some hidden states, those y in our previous notation. We have some finite set of this states and also we have a special start state as zero. We have transition probabilities that model the probabilities of the next state given the previous one in our model. Then we have some vocabulary of output words, and for them, we have output probabilities, so the probabilities of words given tags. Now, can it compute how many parameters do you have? Well, actually, we have just two matrices of parameters. A metrics is N plus one by N because you have also this special start token. And for this special start state, you need to model the probabilities to transit to all other states. For the output probabilities, the dimension will be the number of states multiplied by the number of output words. So we have lots of parameters. We need to learn them somehow. So how can we train the model? Well, let us imagine for now that we have a supervised training data set, which means that we have a sequence of words, and we have a sequence of tags for this words. Then we could just count, so we would count how many times we see the tag Si, which is followed by the tag Sj. And we will normalize this count by the number of times when we see the tag as i. So this way, we will get the conditional estimate to see the tag as j after the tag as i. We could do the similar thing with output probabilities. So there, we would count how many times some particular tag generates some particular output and normalize this by the number of times where we see this particular tag. So this way, we would get conditional probabilities for the output words. So I can tell you that this intuitive estimates are also maximum likelihood estimates, but let us get in a little bit more details and make sure that we understand how we compute these counts. So we have many sentences. Let us just concatenate all those sentences into one long corpus of the length big T. Then we can compute those counts just by running through the corpus from T equal to one to big T and computing the indicator functions. So there, in green, I write down that I see the tag Si on the position Yt minus one and then I see the tag as j that follows and the similar indicator function in the denominator. So this is still the same formula, just another Notation. But the thing is that in real life, usually, you do not have tags in your training data. So the only thing that you'll see is plain text, and you still need to train hidden Markov model somehow. How can you do it? Well, obviously, you cannot estimate those indicator functions because you don't see the tags of the positions, but you could try to approximate those indicators by some probabilities. So compare this formula with the formula in the bottom of this slide. The only thing that has changed is that instead of indicators, I have probabilities now. So something in between zero and one, but how can we get these probabilities? Well, if they have some trained hidden Markov model, we could try to apply it to our texts to produce those probabilities. So the E-step in the top of this slide says that given some trained hidden Markov model, we can produce the probability to see tags Si and Sj in the position T. So this is something like three-dimensional array. T, i, and j would be the indices, and it can be actually done effectively with dynamic programming. The only thing is that we need to have trained model there. So the clever idea is to alternate two steps. The E-step says that let us fix some current parameters of the Model A and B matrices and use them to generate those estimates for probabilities of tags for every certain position. And M-step says let us fix those estimates and use them to update our parameters, and those parameters updates are actually still maximum likelihood estimates in this case. So in the slide, I have everything written down for A matrix, but you can obviously do very similar things to compute B matrix. So this is just a sketch of the algorithm it is called Baum-Welch algorithm. And this is just a case over more general algorithm, which is called EM algorithm, and we will see this algorithm several times during our course every time when we have some hidden variables that we do not see in the training data and some observable variables. This is all about training the hidden Markov model. I hope you have got some flavor of how it can be done. And then in the next video, we will discuss how to apply the hidden Markov model once it's trained.