Finally, let's take a look at a complete numerical example.

In this case again, just like the Google PageRank lecture, we are constrained to a

slide on a page, and therefore the examples do not represent the challenge of

scale or sparsity. If you recall, scale and sparsity, one of

the big challenges for Netflix Prize. So, this only really serves the purpose of

illustrating the five steps that we just outlined for neighborhood method.

Here is our starting point, we are given, these are given constants, the rating

information. Each entry of this big matrix R is rui.

And you see that this is ten user by five movies, level users one up to ten, and

movies A to E. And, it is 80 percent food, okay?

So, 40, out of the 50 possible entries are already actually known.

And this is, again, very atypical. Generally speaking, only about one%,

instead of 80 percent of the entries are filled in a typical real data.

But, if we only have one%, then we don't even have a single entry of known data in

the training. Now, ten of these are represented by bars,

this means that the rating simply does not exist for that user movie pair.

And now, we're going to say let's hide ten out of the fourteen know data as test

data, and use the remaining 30 as training date.

So, 30 samples for training and ten to test our prediction.

So, some arbitrary list, just pick these circled ten entries as test data, okay?

So, we know that we just pretend we don't first and train with the remaining 30

points and see how we do on the on these ten points.

First of all, the very lazy r bar, in this case, 3.83, in Netflix case, 3.6.

All right. Step number one, baseline predictor.

Let's construct the baseline predictor which is r hat ui equals r bar 3.83 in

this example, plus a bunch of biases for each user for each movie.

And this holds for all the ui pairs. Okay?

So, we've got how many? Ten of these parameters and five of these

parameters, we've got fifteen parameters in this case for the baseline predictor,

and we're going to predict the right base line predictor parameters bu star bi star

by solving the Li squared problem. The Li squared problem is, as you'll

recall, A, AB minus C, and with two norm square, minimize this, and the solution is

a transpose A times B star will be equal to a transpose C.

So, what are the A, B's, and C's? The B is easy, that's our fifteen

optimization variables, these parameters under training.

B1 up to b10, and then bA up to bE. What about the matrix A and the vector C?

We'll simply construct those at, like what we did in a small two movie one user

example before. In particular, this A matrix going to be,

we have this fifteen by, we have a 30 by fifteen matrix.

30, because we have 30 training points starting from r1a to r10, okay?

So basically, all the points here that are not red circled are our training data,

okay? And the, we have 30 equations and fifteen

variables corresponding to b1 up to bE. And, the tables basically, this matrix A

is 01 binary, is zero everywhere unless we are talking about that particular user and

particular movie. Say, in this role, the first equation out

of 30, we're talking about r1a training data so we have to write a one where it

corresponds to b1, the first position, and the eleventh position, which is

corresponding to the bA position. And, the C vector is simply the known

training data r1a down to r10E subtracting the r bar.

So, now that we have defined A, B and C we can use this equation as the solution to

this Li squared problem and the resulting solution bu vector star and bi vector star

are just partially shown here. This is ten entries, this is five entries.

And then, you can stick these back into the predictor here, with op, optimize the

bu, bi by training, and you got the baseline predictor, okay?

That's step number one. Now, we can for sanity check, verify, in

the textbook, you see all the entries. But here, you can see indeed, bu is quite

positive, b, b1 is quite positive, and bA is also quite positive.

That means, user one tend to give high scores, and movie A is well-liked.

So, just to make sure that's the case, indeed, user one tend to give high scores,

and movie one is well-liked, okay? A lot of fours and fives.

So, the bias computation is indeed, making some sense.

Step number two, is the prediction on the base line.

So, the base line predictor using the optimized bu, bi and the average r bar

will give you this prediction, okay? We call this r hat.

If you look at this r hat, and remember, some of these are training data, some are

test data. So these, for example, are the test data,

okay? And, let's compare with the actual data,

rui for, say, these five out of the ten. What do we have here?

We have five, three, two, two, five. Those are the actual rui's that we were

hiding from ourselves. Now that, let's see, five, three, three,

two, one. Five, three, two, two, five.

Five, three, two, two, five, two, two, five.

Are they close to five, three, two, two, five?

Yeah. Reasonable, I guess.

4.62. If you round up, and it will be five.

But, if you compute RMSC, then there is a difference of 0.38.

3.49 versus three. This is getting a little off.

Half a star off is quite off, actually. And if you, kind of how you treat this

3.45, you may round it to four, and you actually incur a significant error.

And even without rounding, this difference of 0.49 is also quite big.

This difference is huge, it's between two and 2.78.

This difference is also big, and diff, difference is quite small.

So, we pretty much get this quite right, this quite right, this sort of reasonable.

But, these two predictions are kind of not that good, okay?

One way you can deal with this is to use regularization to avoid overfitting in the

advanced material. But in general, baseline prediction

doesn't leverage any collaborative featuring techniques of global structure,

so we're not doing very well. But, we're not done yet.

So, at the end of step two we shift r by r hat, and we get R2d matrix.

I'm just writing down R2d matrix two columns out of five corresponding to

movies B and C, respectively cuz we're going to use this to illustrate in our

step number three, which is construct the similarity matrix D for each entry of dij,

let's look at a particular example. Let's say, we look at the movie B, C,

okay? This is movie B, this is movie C.

Let's look at, out of the ten users, how many users already rated both movies B and

C? Well, that's one.

User one rated both. User two rated both.

But then, you've got other users either they have not rated some movie or they

have rated, but we can't use those for training, those are for testing.

So, we have to go all the way down to the user nine, where actually we have, data of

user nine rating both movies B and C. So, there are three users, one, two and

nine, that rated both, you, movies B and C and therefore can be used to compute the

distance between movies B and C, the cosine similarity coefficient.