This is a long lecture, the longest in the entire course and one of the most mathematical. But finally, we have arrived at the point of introducing neighborhood method. Up to this point, everything we talked about is about an individual row or individual column in the movie user rating table or matrix. But now, we are ready to go into the global structures. And the way we go into the global structure is by defining a neighborhood L_i for each, say, movie i. Or we can also look at this from a user point of view. We'll instead just focus on the movie-movie correlation. So you can think of two columns representing the two movies and we want to look at how different they are, in other words, define a pairwise statistical correlation between the two movies. And then, we look at another pair of movie, still movie i, but instead of comparing with movie j, we'll compare with movie k and look at the ratings between this and these two columns. And then we're going to pick up all those movies that are somehow very similar or very dissimilar compared to movie i. And both kinds of movies would be very useful. So this is what we're going to do. We'll first define a similarity metric between two movies and then we will search through all the movies out there and then pick up a subset of those that are either very similar or very dissimilar compared to movie i and call that the neighborhood L_i, and we'll leverage that global information in making predictions about users liking or disliking this movie i. So, this idea can be illustrated quite nicely through a simple picture. Suppose we got just two movies and two users, so we can easily compare two points on 2D grid. The two points are the movies A and B, and we've got two users one and two. The points are the movies and the axes are the users. So movie A is getting a rating of five from user one and one from user two. So, movie A is getting a five star and then a one star from these two users, whereas movie B is getting a two star and a five star from these two users, respectively. As you can tell that these two movies are kind of very different. OK. They are basically liked or disliked in opposite polarity by these two users. And indeed, this means that this angle theta is kind of big. What is this angle? This is the angle between two straight lines: one is the line from the origin to this movie's coordinate, and the other is from the origin to this movie's coordinate. And if these two coordinates are far apart, one quantification of that is the angle theta here. We measure the cosine of that angle and that is a particular number and we use that number to describe how similar or dissimilar A and B movies are. This is called cosine coefficient. It is a particular instance of a similarity metric that we'll be using here. So how do we write down the cosine coefficient here? I'm going to say the cosine coefficient between two movies i and j is called d_ij and, if you remember from basic geometry, analytic geometry, it's really this point viewed as a vector from origin to that point. Call that the vector r_i, it's a vector, transpose r_j, that is, this point viewed as a vector from the origin to that point, divided by the normalizations, that is, the size of i, L2 norm, and the size of j, also L2 norm. So we can view these two ratings for two different movies as two vectors r_i and r_j and the cosine of the angle is the standard formula, which is just the inner product normalized by the sizes. And this, in the longer form, it just means r_ui and r_uj, summing over all the u's who have rated both movies i and j, that is. That is the inner product of these two vectors, divided by the square root of, just by the definition of L2 norm of the two vectors, r_ui-squared, summing over all the u's, the same set of u's here in the numerator, and just summing over the same set of u, r_uj-squared. That's just the L2 norm of these two vectors. So, this d_ij is the cosine coefficient defined as such geometrically as the cosine of the angle theta and algebraically defined as such. Now because we have already shifted the r data from r to r minus the baseline predictor r, for each ui pair, we called this our tilde ui at the end for the last segment of the video. So, strictly speaking, we should put tilde here, so all looking at our tilde value that has already subtracted the influence of the baseline predictor. So now, what we need to do is to look at all the pairs of movies i,j, except movie i,i back to itself, and then compute this relationship. So, at this point, we have for each movie i computed the d_ij for all the j's out there. We've got a list of numbers. Let's arrange these in descending orders. OK. These are the d_ij values, and maybe movies five, eight, 10, two and so on with respect to movie, say, i=1, the first movie. OK? All right, very good. If you look at these and say some of these are real big numbers, some of these are real small numbers, real negative numbers. Again, remember, this actually can be either a positive or a negative value. So we will say those that are very positive and those that are very negative are both useful because they are very similar or very dissimilar movies. So let's just look at the absolute value of d_ij's, and we want those that are big to be included. So I'll pick the top, say, five movies, top either meaning they are really big here or really big here. OK. And that will define what we call the neighborhood, a list of other movies with respect to this given movie i. So for example, L_1 could be movies five and eight, and I say there's a movie 21 down here, 21. OK. I say, who determines how big the set is? How long is the list? Who determines basically the cutoff threshold of the d_ij value? Well, the one that would be determining those is you, OK, the recommender. So there are quite a few rules and guidelines in determining how big is this list. If you make it too big, then you are indeed incorporating the comparative information with respect to many other movies, but some of them are so weakly coupled with this movie that it doesn't quite matter. So, just adding too much noise. On the other hand, if you include too few neighborhood movies, then you're not fully leveraging all the information and then you are losing precious information about other movies, so that's not good either. We won't have time to go into the exact art of picking the right size, and that's indeed part of the reasons why it took many computation for these teams to compete in the Netflix Prize. But roughly speaking, the idea is that if you like a movie, say, "Schindler's List" and, say, "Life is Beautiful Life" then maybe you'll like "Doctor Zhivago". If you dislike "E.T." or dislike "Lion King" then maybe you also dislike, say, "Toy Story". So that's the idea. The idea is that we're going to look at all these movies as points in a space. Those that are very close by this cosine theta metric, we'll call them neighbors. And therefore for each movie i, we will define a neighborhood by the absolute value of these cosine coefficient values. So now that we have defined what is the metric cosine similarity, and from there we have defined what is a neighborhood for each movie, we can finally define the predictor based on that information called the neighborhood predictor. So the neighborhood predictor, it will be denoted as r-hat_ui, but we have already used that symbol to represent the baseline predictor, let's add a sup N. This N does not mean the number of users or movies, it's a shorthand for neighborhood. OK? So, we want to find out what is our r-hat_sub_ui^sup^N. This is basically the original baseline predictor now plus what we learn from these neighbors. We learn from these neighbors that if a neighbor movie j in the shifted rating r-tilde, OK. Again, r-tilde is just r minus r-hat. In this shifted space, depending on whether j is positively correlated with movie i or negative correlated with movie i, it's going to give us some information. So we want to add up those information, summing across all the movies j, indexed by j, in the set L_i. So, we have to weigh each such rating by a neighbor movie j in some way. What should the weight be? One choice is just let weight be the cosine coefficient itself, d_ij. Now, this may not be the best choice. We just don't have time to go into optimizing this particular choice of weight. Let's say it's just d_ij. So if d_ij is positive, we're adding to the prediction. If it's negative, we're subtracting from that. And then we have to normalize somehow because we might be adding a lot of these movies in the neighborhood. So we have to divide that by the magnitude of the weights. The weights can be, it's just d_ij absolute, summing over j in the neighborhood of movie i, L_i. Now, why do we put absolute sign here but not here? Well because in normalization, if you don't put absolute value, the positive and negative d_ij's will cancel each other, so that's not intention to count their size. But in the numerator, if we add absolute value, then we get confused whether j is very similar to I or very dissimilar to i. That's why when we calculate the influenced by neighborhood movies j, we use the actual d_ij which could be a positive or negative, but when we normalize by the size, we have to use the absolute value of d_ij's. Now you can make it even simpler by just making the weights one. You simply just count or add up the r-tilde's and divide it by the size of the neighborhood. That's it. That turns out not to be performing very well, so we use a slightly more involved weights here. All right, so this is what we have been looking for in the last, what, 80 minutes. This is the neighborhood predictor that will enable you to get quite far ahead in the Netflix Prize competition. It composed of two parts. One part is the baseline predictor. Again, how do we get to the baseline predictor? Very simple, by solving the least square problems involving these b_u's and b_i's, because the baseline predictor r-hat is just the lazy average predictor plus b_u plus b_i for each u,i pair. OK? So solve least square based on the ground truth, train your parameters b_i*, b_u*, stick it into the predictor, you get r-hat. The second term in the neighborhood predictor is the actual neighborhood information. Again, we use the cosine coefficient as the metric. Then for each movie i, we define a neighborhood of a certain size. And then, based on that, we say all those movies j in that neighborhood set will be weighted with a certain weight, for example, the d_ij themselves, and then normalized. This is the total influence in the prediction of ui. So now, let's summarize what we have at this point. The neighborhood predictor for collaborative filtering in Netflix consists of five steps. The first step is train your baseline predictor. r-hat equals r-bar plus b_u plus b_i. I don't know what should b_u, b_i be, so I'm going to look at this prediction with the actual ui, r_ui, and then look at the difference squared, solve the least squares problem, that will give me the b_u, b_i, and therefore the baseline predictor. OK. Once I have the baseline predictor, let's call that r-hat, then we can subtract r from r-hat and get this shifted r-tilde, which could be positive or negative depending on what's your r-hat. And then from the r-tilde, we can compute a similarity metric. If we collect all these into a matrix, big R-tilde, and from this big R-tilde, we get another matrix D where each i,j entry of big D is simply the d_ij that we computed as a function of these r-tilde's. Once we have computed this d_ij, we can define a neighborhood L_i for each movie i, consisting of a bunch of movies very similar or very dissimilar from movie i. And once that's done, we can now plug everything into the r-hat^N_ui formula which we just shown on the previous page because we have now all the ingredients, including the baseline, the b's, and the r-tilde's. And that will give us the final predictor which we can collect all the ui entries into a particular table or matrix R-hat^sup^N, N for neighborhood method.