0:03

Welcome to week two of practical Bayesian methods.

Â I'm Alexander Novikov and this week we're going to cover the Latent Variable Models.

Â So what is the latent variables?

Â Why do we need them?

Â And how to apply them to real problems?

Â And the second topic for this week,

Â is the Expectation Maximization algorithm,

Â which is the key topic of

Â our course and this is a method to train latent variable models.

Â We will see numerous extensions of

Â this expectation maximization algorithm in the following weeks.

Â So, let's get started with the latent variable models.

Â Latent variable is just a random variable

Â which is unobservable to you nor in training nor in test phase.

Â So, latent is just hidden from Latin and as an example of those you can think.

Â So for example, some phenomenons like heights or lengths

Â or maybe speed can measure directly and some others can not.

Â For example, incidences or altruism.

Â You can't just measure altruism with some quantitative scale.

Â And so these variables are usually called latent.

Â And to motivate why do we need to introduce this concept into probabilistic modeling.

Â Let's consider the following example,

Â see you have an IT company and you want to hire an employee,

Â so you have a bunch of candidates,

Â and for each candidate, you have some data on them.

Â So for example, for all of them you have their average high school grades,

Â for some of them you have their university grades

Â and maybe some of them took IQ tests and stuff like that.

Â And you also conducted a phone screening interview.

Â So, your HR manager called each of them and ask them a bunch of simple questions

Â to make sure that they understand what your company is about.

Â Now, you want to bring these people onsite to make an actual technical interview,

Â but the problem is that you have too many candidates.

Â You can't invite all of them because it's expensive.

Â You have to pay for their flights,

Â for their hotels and stuff like that.

Â So, natural idea arises,

Â let's predict the onsite interview performance for each of

Â them and bring only those who are predicted to be good enough,

Â so how to predict to be a good fit for our company.

Â Well, if you're in the business for a while,

Â you may have some historical data.

Â So, for a bunch of other people,

Â you can know their features like their grades and their IQ scores,

Â and you know their onsite performs because you have already conducted these interviews.

Â Now, you have a standard regression problem.

Â You have a training data set of circle data,

Â and for new people you want to predict their onsite performance,

Â and you want to bring on their onsite interviews only

Â those whose predicted onsite performance is good.

Â However, there are two main problems why we can't apply

Â here the standard regression methods from machine learning.

Â So first of all, we have missing values.

Â For example, we don't know university grades for all of

Â them because Jack didn't attend university.

Â And it doesn't mean that he is not a good fit for your company.

Â Maybe he is but he just never bothered to attend one.

Â So we didn't want to ignore Jack but we want to

Â anyway predict for him some meaningful onsite and field performance score.

Â And the second reason why we don't want to use

Â some standard regression methods like linear regression or neural networks,

Â is that we may want to quantify uncertainty in our predictions.

Â So, imagine that for some people,

Â we may predict that their performance is really good and we

Â certainly want to bring them onsite and maybe even want to hire them right away.

Â But for others, the predict performance is not good.

Â And for someone, the predict performance can be for example of 50,

Â which may mean that this person is not a good fit for your company.

Â But it may also mean that we're just not sure about him.

Â So, we don't know anything about him and you asked the algorithm to predict

Â his performance and he returned you some number but it doesn't mean anything.

Â So in this case,

Â we may want to quantify the uncertainty of the algorithm in the predictions.

Â So, if the algorithm is quite sure that this person

Â will perform at a level of 50 out of 100 for example,

Â then we may not want to bring him onsite.

Â On the other hand, if some other guy predicted performance is also

Â 50 but we're really uncertain about his performance,

Â then we may want to bring him anyway and see maybe we just don't know anything about him.

Â So, he may be good after all.

Â And the reason for this uncertainty may be for example that,

Â he has lots of missing values or maybe his data is a little bit contradictory or

Â maybe our algorithms just aren't used to see people like that.

Â So these two reasons,

Â having missing values and wanting to quantify uncertainty,

Â bring us to the needs of for probabilistic modelling of the data.

Â And as we discussed in week one,

Â one of the usual way to build probabilistic model is to start

Â with drawing some random variables

Â and then understanding what are the connections between these random variables.

Â So, which random variables correlate with each other in some way?

Â And in this particular case,

Â it looks like everything is connected to everything.

Â Like if a person's university grades are high,

Â it directly influences our beliefs about

Â his high school grades or his IQ score and this is true for any pair of variables here.

Â And the station where we have all possible edges,

Â like everything is connected to everything,

Â means that we've failed to capture the structure of our probabilistic model.

Â So we end up with the most flexible and

Â the least structured model that we can possibly have.

Â And in this situation,

Â we have to assume.

Â So to build up a probabilistic model of our data now,

Â we have to assign probability to each possible combination of our features.

Â So there are exponentially many combinations of different university grades,

Â different IQ scores and stuff like that.

Â And for each of them we have to assign a probability.

Â And this tables of probability

Â has been billions of entries and it's just impractical to treat as parameters,

Â to treat these probabilities as parameters.

Â So, we have to do something else but we always can assume some parametric model, right?

Â We can say that we have these five random variables

Â and that probability for any combination of them is some simple function.

Â For example, exponent of linear function divided by normalization constant.

Â In this case, you reduce your model complexity by a lot.

Â Now, we have just like five parameters which we want to train.

Â But the problem here is with the normalization constant.

Â So, to normalize this thing,

Â so it will be a proper probability and it will sum up to one,

Â we have to consider the normalization constant which

Â is the sum of all possible configurations.

Â And this is a gigantic sum.

Â We have to consider all billions of possible configurations to

Â compute it and this means that the training and inference will be impractical.

Â So what else can we do here?

Â Well, it turns out that you can introduce a new variable

Â which you don't actually have in your model of the world,

Â which is called intelligence.

Â So you can assume that each person has

Â some internal and hidden property of him

Â which we will call intelligence and for example measure on the scale from one to 100.

Â This intelligence directly causes each of

Â these IQ scores and university grades and stuff like that.

Â Of course, this connection is non-deterministic.

Â So an intelligent person can have a bad day and write test poorly.

Â But this is direct causation,

Â so intelligence directly causes all these observations.

Â And if I assume such a model,

Â then we reduce the model complexity by a lot.

Â We raised lots of features and now our model is much simpler to work with.

Â So now, we're going to write

Â our probabilistic model by using the rule of sum of probabilities,

Â it's the sum of

Â all possible configurations given the intelligence times the prior probability.

Â And these are conditional probability

Â factorizes into product of small probabilities because of the structure of our model.

Â So now, instead of one huge table with all the combinations for five different features,

Â we have just five small tables that assigns probabilities

Â to a pair of like IQ score given intelligence.

Â This means that they're able to reduce the model complexity and now to model

Â without reducing the flexibility of them all.

Â So to summarize, introducing latent variables may simplify our model.

Â So it can reduce the number of phases we have.

Â And as a consequence of that,

Â we can reduce the number of parameters.

Â And some other positive feature of latent variables,

Â is that they are sometimes interpretable.

Â So for example, these intelligence variable, we can,

Â for a new person we can estimate his intelligence on the scale from

Â one to 100 and for example it can be 80.

Â What does that mean?

Â Well, it's not obvious because you don't know

Â what the scale means and you're not even sure that

Â this intelligence means actual intelligence

Â because they never told you model that these variables should be intelligence,

Â you just said that there should be some variable here.

Â But anyway, this variable can be interpretable and you can compare

Â intelligence according to this scale of different people in your data set.

Â And some downside of latent variable models is that they can be harder to work with.

Â So, training latent variable model,

Â you have to rely on a lot math.

Â And this math is,

Â what this week is all about.

Â So in the next videos,

Â we will discuss methods for training latent variable models.

Â