What is this Netflix Prize? In October 2006, Netflix said that, look I've got something called Cinematch, the internal recommendation engine that Netflix was using. And it gave me a certain kind of RMSE variety, and I'm wondering if I can do better. For example, can I improve this by say 10%? You might say wow 10% is not a big improvement isn't it? Well first of all, this sewing match was not a stupid algorithm and second, even a few percent improvement in the ANC can change the order of the recommended movie and therefore the effect on customer satisfaction, inventory control and so on. From 1999 to 2006, I've got seven years of ratings by different users on different movies. And I'll make available about 100 million of these ratings, that come from 480,000 users on 17,770 movies. So on average, each movie was rated by more than 5,000 users. And each user rated more than 200 movies. Now you can see that this is a huge data set. And it's all real data. Of course it's anonymized. There was some controversy on how well was the privacy protected but that's outside the scope today. So it's big data. It's also sparse data because only about 1% of the total number of possible rating were actually known. If you multiplied the number of movies by the number of users, you actually see this is actually a small number. And furthermore, it is skewed data. The average number of users per movie and movies per user, it's kind of big, reasonable numbers. But, some users actually rated a lot of movies. One user actually rated almost every single movie on there. And some users rate very few movies. Just several of them. And whatever recommendation engine you use has to work for all movies and all users. So how do we deal with the big sparse and skewed data and that's the challenge from this Netflix Prize to get to 10% improvement over the Cinematch or MSC. Now this was a huge attraction, already, and on top of that Netflix also offered $1 million, US, to whichever team, or person, that could get to 10% first. So this was the famous Netflix prize, it's open to all countries and people except to just a couple. It's online. Okay, you don't have to show up in person. And it's international. Clearly both the $1 million and perhaps more importantly the huge data set made available but very attractive to the machine learning and data mining, Information Retrieval Research Community. So, we're going to take a quick look at the data set that's involved here. The entire data set, which is a little over 100 million entries of RUIs were divided into four parts, the training set, the probe set, the quiz set and the test set. They say why did they make it so complicated, for example, why don't we just make it into two set, the training set and the test set. It turns out this is a pretty smart arrangement, the training and probe set were made public. And they are just big enough that it becomes very interesting, small enough that they can fit into a normal laptop In 2006. And then, the quiz and test set, these data were hidden, okay? Only Netflix have the ground truth. The competitors in the prize did not have access to these actual data. So, as far as our competitors, say you and I form a team. As far as we're concerned, we have access to training data which is almost 100 million, plus 1.4 million of probe data. Since we have access to this data, we can just hide it as grand truth, train our algorithm and then look at a recommendation accuracy on the probe set. It turns out that the probe set's size and statistical characteristics, in terms of distribution of movie and user, are very similar to the quiz set and the test set, which we don't know. So we can use this to test our methods as frequently as we want, many times a day. But then the actual competition result is not determined by this. It is determined by some unknown to us data known to Netflix. So we can submit our algorithm up to one time a day to the quiz set. Why a limitation? Because if we submit too many times, say every 50 seconds, we may have enough hint on the actual underlying data and reverse engineer the ground truth, then that wouldn't be useful for Netflix. And once somebody can hit 10% improvement over signage on the quiz set, it enters into the final call phase, and then the final result will be based on your performance in terms of RMSE on the test set. The leading teams are shown on the website. I think the website still alive, you can just google Netflix prize. Based on their performance of RMSE on the Quiz Set. So basically you used Training Set to train Probe Set to test yourself. Quiz Set to test on a regular basis if you want but not too frequently and then the eventual result is on the Test Set. And you wonder, this 10% improvement, why not, say, 11%? Why not 9%? I don't know how Netflix decided 10 is a good round number. But had it been 11%, it would have been a lot more challenging, so happens for this metric and this set of data, both the training and test set getting to 11% would have been extremely difficult. And getting to 9% would have been a little too easy but it was a good pick 10% in hindsight. So what happened in the competition is a very interesting sunken story we just briefly present some highlights. First of all, very soon, within a week, okay, somebody already was able to actually beat sign match. But you need to beat by 10%, so have to go from, 0.9514, up to four digit of accuracy of RMSE on the Quiz Set. Remember the test set is reserved for the final phase. So 10% beating by a tiny bit. On the quiz set in terms of RMSE over sign match was achievable within one week. Now have to push this down by 10%, all the way to 0.8553. Now you may wonder, these are very small numbers, less than 1, and 10% of that is very small. Well actually don't forget, this whole thing is on a scale of 1 up to 5 stars, so 1 is actually a pretty big deviation. For example, if you see a movie recommended with 4 and 5 stars, 4.5 stars versus 3.5 stars, that actually clearly shows a pretty big difference. All right, and then in almost one year's time, by September 2007 now. There's a team called Bell Kor that made an 8.26% improvement over sign match. Very good, seems that 10% will be eminently achievable, but that turned out not to be the case. The first place changed hands a few times until, in the last one hour before the first year of the competition ended, this same team retained the leading position with 8.43% improvement. And that gave them $50,000 USD of an prize for leading the chart, even though it didn't succeed in getting 10%. And then what happens from 2007 to 2008 is that teams start to merge. You have a mom and pop shop back then to be able to be somewhere on the chart, it becomes very difficult. Because the smart teams merged together so that they can share their secret, in particular Bell Kor. And Big Chaos, another team, emerged, and this team won the 2008 progress prize in October of that year. At that point this RMSE gets down to 0.8616, and then it merged again with the team called Pragmatic Theory. So now these three teams become a one team, it's called Bell Kor's Pragmatic Chaos. So grab one word from each of the original three composite teams. And in June 2009, almost three years into competition, this merged team became the first one to achieve more than 10% improvement. Okay, in fact it got 10.06% of the RSMC, over benchmark. At that point, the competition entered into the final last call phase, all the team got 30 days to make their final submissions. At the end of that time two teams actually beat match by more than 10%. One team is this Bell Kor Pragmatic Chaos, which got an RMSE of 0.8554. The other is called the Ensemble, got 0.8553 actually, that's slightly better, but remember that's not a quiz set. The final prize would be determined on the performance on the test set that no one had seen yet or no one had been using yet. And here is the grand finale, actually all of them, both of them received 0.8567 as the RSMC. The same performance up to four digits, that's what the last digit that count, so, who's going to win? We'll have to determine by who submitted the final solution earlier. It turns out Bell Kor Pragmatic Chaos submitted their algorithm 20 minutes, Before the Ensemble, so because they both achieve the same performance or RMSE on the test set, it comes down to the submission time. And the Kor Pragmatic Chaos won the competition with a differential of 20 minutes in this wonderful, almost three year scientific quest. That's the grand finale, so you must be wondering what did the Bell Kor Pragmatic Chaos use to achieve this milestone and receive the prize? Well You can actually go online and see three documents, three PDFs, where you can read all the algorithm details. As part of the competition deal, you have to release your public information, and allow Netflix to use it or its variant. So, this is our public information at this point, except if you go through that you'll see it is just a bag of many tricks, and thousands of parameters fine tuned based on the training data. If you want to get to the gist of the key ideas, actually, will take much less time, especially, the only interest in getting to about 8 to 9% improvement. That's what we're going to target in the next, remaining part of this lecture. If you want to get to exactly little above 10%, then it would take many more tricks. So, I will focus to, how to get you to 8 to 9%, and what are the key ideas behind it? So here is a pictorial illustration of the problem, the problem is that you've got many users, you've got many movies, and you can put them into a matrix or table. You can also think of this as a bipartite graph, with the user nodes on the left, movies nodes on the right, and it's bipartite because only users and movies are paired up. And then you can give weights to these lengths, 1 to 5, so there are many possible angles. First we're going to view this as a table, now in this table actually is a little misleading because a is too small. There are only 48 possible entries, whereas actually we can have around 10 billion possible entries in the actual Netflix price. It's also very dense, you can see that about half of the cells are actually filled with actual rating data, it's a lot sparser in the actual Netflix prize. So the size and the sparsity are actually not accurate, but suffice as to conceptually illustrate what the problem is. You're given these data points which are 1 to 5, integers, and whatever is a blank means that this user has not graded this movie. Now, she may have watched the movie or not, but there's no rating recorded. And you see four question marks, these are basically entries where Netflix knows the grand truth, but you and I as teams entering to the prize competition, we do not. Or maybe we do but we have to resolve it as the probe set, so in any case, when we train the data shown here. And then when we test, or when Netflix tests, then they can use these hidden grand truth. If it is there this particular table here you see well, how do I predict these four entries? Maybe I'll say, just look at this movie, okay? If this movie has got a lot of high rating just give it a high rating like 5, 5 or maybe this is 5, 2. This is 4, 4, 5. I'm going to add a good movie in. Make it 4.5. Okay, this one is 2, 3, 4. That's a tough call. Maybe I'll look at this road here. Okay, I'm trying to predict how would user 5 like the movie three. User 5 seems to give 3, 3 all the time. Maybe, should also give 3 here. What about this entry? The movie 7s gets 1, 3 and 4, and kind of diverse. And this user should give 2 and 4. So, it's really hard to say what this might be. Okay, it could be 1, 2, 3, 4, 5. I don't know maybe just give an average. Whatever's the average of all the entries here. Say the average is 3.5, all right 3.5. Sounds a little like filling out the table in Sudoku. Well, in what we just talked about, we actually touched upon on a few intuitions. One is that certain movies are in general better made than certain other movies. And certain users are very strict graders or very loose graders. We have to take into account those things. We can also when there's no other help just give up and say take the average of the whole table as the prediction, that's a very lazy predictor. So there's some intuitive understanding to submerging but none of this is good enough to get you 8%, not even close, you have to think a little harder than that. But in what we just talked about, we actually have touched upon what's called content based filtering. In other words, just look at an individual column or row, column corresponds to movie lists, let's say individual column. Do not worry about what the other columns show. And this is similar in a crude analogy to the Google, if Google were computing for relevance score. But if you wanted to input a score by looking at the connectivity pattern among the web page. Or in this case what the others movies show and tell about this movie, they need to go to from content base, to collaborative filtering. You say, why do I care about what the other movies are like? Because maybe, certain movies are liked by a particular user. Okay. And as far as that user's concerned, maybe these two movies are very similar. And therefore I would be able to say these two are neighbor movies. Okay? So if some user likes this movie maybe she will like this movie too. We are basically defining a neighborhood relationship between movies. Now we could also do a neighborhood relationship construction for the users. If certain users from the different existing entries show their taste are very similar. There may be then if user three likes a movie, user four even though she hasn't watched the movie will like it too. This is actually what we intuitively do a lot of times. There's another idea which is a little more involved and we'll postpone it until advanced material part of the video. Is to say that maybe there are some kind of feature sets about these movies. For example is this a romantic movie or action movie, a thriller movie, a sci-fi movie. Is it long, is it short and so on. And these feature sets is a much smaller set than the set of possible movies. There might be 20,000 movies but only about say 20 different genres. And I will try to extract the underlying reason why certain movies are liked or disliked by some people. And these two are the two key distinct approaches of collaborative filtering. The neighborhood method and was called the latent factor method, these latent factors are these feature set entries. We will focus neighborhood method, and then reserve latent factor method for the advanced materials. We just mentioned this, a large and sparse data is the big challenge. One solution is just look at each row and each column in isolation, so that content-based filter. The other is look at entire table. That's called the collaborative filtering. Collaborative among the users and movies I guess. And this is just like this one, it's a centralized computation so we don't need a distributed computation. And within collaborative filtering there are two main flavors. It turns out these two are also interestingly connected. But we will skip that. Is in the textbook and advanced material. One is neighborhood method. The other is latent factor method. Along this way of going to neighborhood method, we'll take a couple of detours. One is go into a particular optimization problem, called the least squares, is one of the most frequently used. Optimization problem in many diverse fields. From control of airplanes to your remote. From how light bulbs are controlled on the factory floor to a lot of supply chains. It is used in finance. And economic modeling it is used in many communication systems. It's hugely useful optimization problem. And it is a special case of something called Convex optimization. So we will take about a five minute detour into defining a convex optimization because this will be a recurring theme. We'll see this a few times later in the course. And we will, actually not really have time to take a detour of two very important ideas to win say, the first year progress prize. One is implicit feedback. If somebody has not watched or had watched a certain movie, that is also useful information, not just the explicit writing. And the timestamp, TUI is also important. First, movies come in out of fashions Over that period of seven years. And second, different users also may have different moves on different days. There's a batch processing effect if this user decides to rate the last 10 movies she watched on Netflix. On the same day, those ten ratings tend to have a different statistical distribution then if they were entered right after watching each movie on different days. Incorporating both implicit feedback and Temporal dynamics would be important for you to get to, say 8.5% improvement over tsi match, but we would not have time to go through these. So our focus right now will be on the Methodology of collaborative filtering in the neighborhood method.