0:00

Finally, let's take a look at a complete numerical example.

In this case again, just like the Google PageRank lecture, we are constrained to a

slide on a page, and therefore the examples do not represent the challenge of

scale or sparsity. If you recall, scale and sparsity, one of

the big challenges for Netflix Prize. So, this only really serves the purpose of

illustrating the five steps that we just outlined for neighborhood method.

Here is our starting point, we are given, these are given constants, the rating

information. Each entry of this big matrix R is rui.

And you see that this is ten user by five movies, level users one up to ten, and

movies A to E. And, it is 80 percent food, okay?

So, 40, out of the 50 possible entries are already actually known.

And this is, again, very atypical. Generally speaking, only about one%,

instead of 80 percent of the entries are filled in a typical real data.

But, if we only have one%, then we don't even have a single entry of known data in

the training. Now, ten of these are represented by bars,

this means that the rating simply does not exist for that user movie pair.

And now, we're going to say let's hide ten out of the fourteen know data as test

data, and use the remaining 30 as training date.

So, 30 samples for training and ten to test our prediction.

So, some arbitrary list, just pick these circled ten entries as test data, okay?

So, we know that we just pretend we don't first and train with the remaining 30

points and see how we do on the on these ten points.

First of all, the very lazy r bar, in this case, 3.83, in Netflix case, 3.6.

All right. Step number one, baseline predictor.

Let's construct the baseline predictor which is r hat ui equals r bar 3.83 in

this example, plus a bunch of biases for each user for each movie.

And this holds for all the ui pairs. Okay?

So, we've got how many? Ten of these parameters and five of these

parameters, we've got fifteen parameters in this case for the baseline predictor,

and we're going to predict the right base line predictor parameters bu star bi star

by solving the Li squared problem. The Li squared problem is, as you'll

recall, A, AB minus C, and with two norm square, minimize this, and the solution is

a transpose A times B star will be equal to a transpose C.

So, what are the A, B's, and C's? The B is easy, that's our fifteen

optimization variables, these parameters under training.

B1 up to b10, and then bA up to bE. What about the matrix A and the vector C?

We'll simply construct those at, like what we did in a small two movie one user

example before. In particular, this A matrix going to be,

we have this fifteen by, we have a 30 by fifteen matrix.

30, because we have 30 training points starting from r1a to r10, okay?

So basically, all the points here that are not red circled are our training data,

okay? And the, we have 30 equations and fifteen

variables corresponding to b1 up to bE. And, the tables basically, this matrix A

is 01 binary, is zero everywhere unless we are talking about that particular user and

particular movie. Say, in this role, the first equation out

of 30, we're talking about r1a training data so we have to write a one where it

corresponds to b1, the first position, and the eleventh position, which is

corresponding to the bA position. And, the C vector is simply the known

training data r1a down to r10E subtracting the r bar.

So, now that we have defined A, B and C we can use this equation as the solution to

this Li squared problem and the resulting solution bu vector star and bi vector star

are just partially shown here. This is ten entries, this is five entries.

And then, you can stick these back into the predictor here, with op, optimize the

bu, bi by training, and you got the baseline predictor, okay?

That's step number one. Now, we can for sanity check, verify, in

the textbook, you see all the entries. But here, you can see indeed, bu is quite

positive, b, b1 is quite positive, and bA is also quite positive.

That means, user one tend to give high scores, and movie A is well-liked.

So, just to make sure that's the case, indeed, user one tend to give high scores,

and movie one is well-liked, okay? A lot of fours and fives.

So, the bias computation is indeed, making some sense.

Step number two, is the prediction on the base line.

So, the base line predictor using the optimized bu, bi and the average r bar

will give you this prediction, okay? We call this r hat.

If you look at this r hat, and remember, some of these are training data, some are

test data. So these, for example, are the test data,

okay? And, let's compare with the actual data,

rui for, say, these five out of the ten. What do we have here?

We have five, three, two, two, five. Those are the actual rui's that we were

hiding from ourselves. Now that, let's see, five, three, three,

two, one. Five, three, two, two, five.

Five, three, two, two, five, two, two, five.

Are they close to five, three, two, two, five?

Yeah. Reasonable, I guess.

4.62. If you round up, and it will be five.

But, if you compute RMSC, then there is a difference of 0.38.

3.49 versus three. This is getting a little off.

Half a star off is quite off, actually. And if you, kind of how you treat this

3.45, you may round it to four, and you actually incur a significant error.

And even without rounding, this difference of 0.49 is also quite big.

This difference is huge, it's between two and 2.78.

This difference is also big, and diff, difference is quite small.

So, we pretty much get this quite right, this quite right, this sort of reasonable.

But, these two predictions are kind of not that good, okay?

One way you can deal with this is to use regularization to avoid overfitting in the

advanced material. But in general, baseline prediction

doesn't leverage any collaborative featuring techniques of global structure,

so we're not doing very well. But, we're not done yet.

So, at the end of step two we shift r by r hat, and we get R2d matrix.

I'm just writing down R2d matrix two columns out of five corresponding to

movies B and C, respectively cuz we're going to use this to illustrate in our

step number three, which is construct the similarity matrix D for each entry of dij,

let's look at a particular example. Let's say, we look at the movie B, C,

okay? This is movie B, this is movie C.

Let's look at, out of the ten users, how many users already rated both movies B and

C? Well, that's one.

User one rated both. User two rated both.

But then, you've got other users either they have not rated some movie or they

have rated, but we can't use those for training, those are for testing.

So, we have to go all the way down to the user nine, where actually we have, data of

user nine rating both movies B and C. So, there are three users, one, two and

nine, that rated both, you, movies B and C and therefore can be used to compute the

distance between movies B and C, the cosine similarity coefficient.

10:09

I'm sorry, this is minus. I should say minus.

This is also minus. So, -0.9 plus -0.11, 0.31 plus -0.87 times

0.33, and I have to normalize. So, the square root of 0.91 squared plus

0.11 squared plus 0.87 squared times 0.9 squared plus 0.31 squared plus 0.33

squared, and this is the formula for cosine similarity coefficient dBC, which

turns out to be -0.84. So, notice these entries are negative

because after shifting the rating from baseline predictor, sometimes it's

positive or negative relative to the baseline predictor.

And these coefficients of similarities are sometimes negative because the two movies

are dissimilar from each other. And these two are indeed quite similar,

okay? This -0.84 is quite close to -one.

So, we can say movies B an C are quite dissimilar.

Now, we can do this now for all the pairs of movies to come up with this entire

matrix D here. The diagonals doesn't matter, so just

cross them out. The off diagonals are symmetric cuz the

way we define it, dBC equals, dCB, okay? And, indeed, this is dBC or dCB, same

thing is -0.84 as we just calculated, and you simply follow the rest of the, the

same structure for rest of the entries. That's the end of the step three.

We have constructed similarity matrix. Now, step four is to define a

neighborhood. Let's say, a neighborhood for any movie is

just two. Okay?

So, the top two movie with a absolute value dij in the largest relative to movie

j will be conclude, included in the neighborhood.

So, let's take a example. For example, let's say, r hat 3D.

We want to estimate user three rating movie D.

Remember, user three rating movie D, we did not have that actual information.

We can see here 3D is one of those that we we're hiding from ourselves so that now we

can predict, see if we can get to two, okay?

So, we want to find out what is r hat 3D based on the neighborhood method.

Well, first of all, there is the r hat 3D baseline predictive component, then there

is the neighborhood prediction component. This happens to be 2.78.

Now, what is this? Since we decided the neighborhood consists

of two movies, so for movie D, what are the top two similar or most dissimilar

movies? This is for movie D and we can see that

movies A and B are most similar or dissimilar in their absolute dij terms.

Actually, both are very dissimilar, A and B are very dissimilar, okay?

So, we have used, so far, the information about different people's taste on these

movies, A. This is A up to E, right?

A down to E, between movies D and A and B. And then, we take that information, okay?

The fact that A and D are very dissimilar by people's record, and B, D are also

quite different by people's record. As a way to say that, therefore, AB should

be incorporated as we predict anyone's prediction liking or disliking movie D.

So, following our formula, we have -0.97 times the weighting where the weighting

turns out to be the prediction of the rating of user three on movie A, okay?

Plus -0.73 times the rating of user three on movie B.

And, this happens to be 0.9 in the shifted domain, and this happens to be -0.19 in

the shifted domain. So, in particular, we see that the same

user three whose taste on movie D we're trying to predict really liked movie A.

But, since A and D are very different kinds of movies, they're so dissimilar,

that means she will very likely not like movie D.

This is a big positive number, this is a big negative number.

The overall impact is to make this r hat 3D, and to be a very small number.

And then, we just normalize now by the absolute value of the dij, which is 0.97

and 0.73. This, therefore, happens to be 2.35.

In other words, the net effect of the neighborhood help is to reduce the

prediction from 2.78 to 2.35. The actual answer is two.

And indeed, in the base line predictor, it overshoot two by too much, very close to

three stars. Whereas, with the help of neighborhood,

knowing that A and B are very dissimilar from movie D, and user three kind of likes

A and doesn't hate B too much, which actually tells us that she is going to

really not like D a whole lot. With that information, we push 2.78 down

to 2.35 which is, indeed, much closer to the actual ground to two.

If you computed the, our squared error between two and 2.78, you can see it's

much bigger than that between two and 2.35.

17:11

So, now we can compute this for all of the ten items that we want to predict.

And it turns out that, we get a much better improved performance in the

baseline predictor, r hat. We get the error of 0.51 and 0.58,

respectively, for the 30 known training data and the ten test data.

But, this get improved into a neighborhood predictor r hat N where at the overall,

the root mean square error, for the training data is as small as 0.32, and for

the ten test data is also small as 0.54. Again, it's what these ten predictions

that really matter in Netflix Prize. You may think these are very small

numbers, again, in a scale up to five stars, half star is a big deal.

And even improving on the next digit is often going to make a big difference in

what we recommend, and that's what the ten percent goal of Netflix challenge is

trying to do here. This represents seven percent enhancement

by neighborhood method over baseline method.

And it turns out that, if you use neighborhood method plus temporal dynamics

plus regularization idea of Li square, plus a couple of other tricks, you can

easily get to almost nine%. Meaning, you can actually win the first

entire year of progress prize, and more of enhancement already sign match.

But, getting the other one, or little over one percent to get to ten percent target

of enhancement over sign match took two more years of effort by many smart people

around the world. Thousands of teams, tens of thousands of

different submissions and two years time to enhance that 1-1/2%.

That's why a lot of people believe that half Netflix said this to be eleven or

twelve%, it may not even be achievable using the known data, that we will never

know the exact answer. But, what you have seen so far with a few

tricks get you quite far in that prize, if you were participating back in 2006.

So, that's the end of this longest lecture in the entire course.

We looked at Netflix Prize, a very interesting scientific story as a special

case of recommendation system. In particular, can help collaborate a

future leverage and similarities among users or, as in our example about movies,

to enhance the accuracy of prediction, say, measured by root and square error.

And if you did minimize root and square error, which you don't have to, as we saw,

this would lead to often a Li squares problem, which is a special case of convex

optimization that we'll see over and over again in the course.

This is the, just a very brief terminology introduction on convex state today.

So, this lecture is long in part because not only we're talking about collaborative

futuring as a learning idea, but also start to introduce convex nonlinear

optimization. It's also long in part because Netflix

Prize is very interesting story to show the timeline and have some details of that

competition. Now, this special case recommendation

system will lead us to the next lecture, Lecture five, on recommendation system on

Amazon in a different context, and a different question over there.

So, see you.