0:00

So, we're going to go into neighborhood method now.

But, before we can do that, we first need to go through what's called a baseline

predictor. And before we can go through a baseline

predictor, we first want to talk about the general notion of parameterized model.

Now, we're going to see the parameterized model in a very different context in the

next chapter with Bayesian analysis. In general, we want to build a model that

is some mathematical representation of the problem we want to solve.

Sometimes, this model carries with it a bunch of parameters, we'll see a lot of

parameters today. And these model, parameterized models will

go into the so called training phase first.

Where you have some observations, but you also have the ground truth.

And you stake the observations, make prediction, and compare with ground truth.

But then, the difference is what something they want to minimize by tuning the model

parameters. After this step of training, you have

optimized model parameters. Then, you go into the actual real action,

which is the prediction step, where you take these optimized parameters in the

model, optimized by the training phase, with the known ground truth.

And then, you, you take the observation to make predictions.

Sometimes, other people have the ground truth, like Netflix had the ground truth

for the Netflix Prize, but you don't. So, training followed by prediction.

And training phase would train your parameter just like a voice recognition

system now on your computer or phone or car.

And you first train them by speaking certain known sentences.

These sentences are the ground truth, and then the computer will take in the

observation which is the recording of your voice, and then train the parameters

before going into the actual action of making predictions in the future for you.

All right. So, having said that, let's go into the

baseline predictor. This is the step number one before we can

afford to go to step number two and neighborhood predictor.

So, baseline predictor does not utilize the global structures of the user's hit

movies rating. We're going to start the baseline with the

baseline of the baseline, that's the laziest predictor that you can think of

the average predictor, let's call that r bar.

R bar is simply the summation of all the rui's across all the ui appears where

there is a rating and divided by the total number of set ratings.

That's it, simply the average. It turns out that Netflix's Prize training

data, 100 million data points, r hat is 3.6.

That is, over all these 100million ratings, the average is 3.6.

And we can say, well, that is our prediction.

We predict that all the users and movies interactions will be 3.6 stars in the

future. And clearly, this is a very lazy, we can

do a lot better than this. For example, we know that certain movies

are better and certain reviewers are harsher.

So, let's say, the predictor, r hat is r bar but with some bias terms.

So, b, bias terms with a regard to the user, and bias terms with regard to the

movie. So, r, r hat ui is the average predictor,

the laziest one plus biased term on the user, on a user and some biased term on

that movie. And now, we can look at the root mean

squared error. We can also take the square root of this

root. So, just look at the mean square error, if

it's, that doesn't change our answer of the optimal parameters.

So, we look at the summation of the difference squared, difference between ru,

ui, and r hat ui squared. We can also delete the step of dividing by

the total number of, of ratings because that's just a constant, it doesn't change

our answers. So, this is our objective function for the

minimization. Minimize over the set of bu's and bi's,

the summation of the square of the differences between the ground truth, rui,

which you have in the training phase, and the prediction r hat ui, where r hat ui

assumes this formula. Okay?

So, this is the function involving some squares of the bu and bi terms.

R hat is not a variable because it is just the average, say, 3.6 for Netflix, and the

bu, bi are your variables. That's an optimization problem, you've got

objective function which is the square. Definitely, not a linear programming

anymore, but we'll see if this is a nice nonlinear programming.

And, there's no constraints, okay? The variables are the bu, bi's, and the

given constants are the known truths, rui's, and this simple scalar, r bar that

describes the baseline predictor training optimization.

So, let's take a look at a particular example of this.

Maybe get rid of the r because we are effectively taking the square root of the

error, the mean square error minimization. Or, the total square error, since we're

not dividing by the number of ratings either.

Let's say, we have just two movies A, B, and we just have one user, one.

So, this is a ridiculously small example. In fact, it's so small that we have more

parameters than we have data points. We have three parameters, bA and for the

bias terms of the two movies, and the b1 which is bias term of, for, of this single

user here. We have three parameters and we have two

norm ratings from the ground truth, r1A and r1B.

And, what is our minimization problem? That is, looking at the square of the

difference between the prediction and the ground truth.

Since we have taken the square, it doesn't matter which way we write out the

different list say we write out r hat minus r, that is b1 plus ba plus r bar,

okay? This is our prediction.

6:56

R hat of 1A minus the actual 1A, and this thing squared plus the other error, which

is 1A, b1 plus, plus the average r bar minus the actual 1B rating squared.

So, this is the r1B hat minus r1B, this is the r1A hat minus the actual r1A, okay?

And we take the sum of the squares of the differences, and that is the object of

function. We want to minimize this object of

function subject in no constraints over the three optimization variables namely,

our three model parameters, bA, , b1. So, our question is, is this an easy

optimization or not? It's clearly not linear but it is an

unconstrained quadratic minimization over the variables.

It turns out it is a very easy problem. It is so called least squares problem,

okay? While minimizing the so called convex,

we'll come back to this in a minute. Convex quadratic function over no

constraints, and this is a very nice nulling optimization.

In general, we say, a least square problem is the minimization of the following

vectors, l2 norm. First of all, what is l2 norm?

So, given a given vector x1 to measure size, one standard way that is l2 norm is

the square root of the individual entry squared summing over all the entries.

We often don't like the square root thing, we take the square of the l2 norm.

So, square of the l2 norm is simply the sum of the xi squared over i.

It's a very standard way to measure the length of a vector, and we would like to,

in this square, minimize the l2 norm square, the length of the vector, Ab minus

c, where b is our variable, A is a known matrix, c is a known vector.

Say, if A is N by M, then b is M by one a column vector, and C is N by M by one,

another column vector. In the particular case we just saw, we can

also write this problem minimizing this expression in this the standard form.

It looks a little weird, but you sort of see, it's alright.

The matrix is one, one, zero, one, zero, one.

The variable vector we have is b1, bA, . And the constant vector term we have is

r1A minus r bar, and r1B minus r bar. This is our A, this is our b vector, this

is our c vector. Just to conform to the standard expression

that we want to minimize in the least squares problem.

You can easily check, minimizing this is this, l2 norm squared is the same as

minimizing this expression, okay? Example, this is just saying b1 + bA - r1A

+ r bar. And that's exactly this term, alright?

And then, you can write out the other term.

And then, just by definition of the square of the l2 norm, you see we are recovering

exactly the same object of function. So, we are, in the baseline predictor

problem solving a least square problem. So, the question is, how do we solve this

square problem in general? Okay?

Well, in general, we want to say, we want to minimize Ab minus c l2 norm squared

which can also be written out as Ab minus c transposed times Ab minus c.

Just it, multiplying itself in product in with itself.

You can write this out if you're not comfortable with the matrix notation in

scalar format. We can write out a little two by two

matrix A, you know one, two, three, four and a little vector b1, b2 minus little

vector c1, c2. And then, you can see that the derivation

coming up is no mystery. But, to save time, let's use the forehead

matrix notation. It's almost similar to our scalar

operation, okay? This will be Ab transpose times Ab minus

Ab transpose c minus another Ab transpose c and plus c transpose c.

And, this can in turn be written simply as by transpose rule, b transpose A transpose

Ab minus two times b transpose A transpose c plus c transposed c.

15:39

And, you're minimizing over bu, bi that function.

It turns out picking the right bu, bi and therefore, completing this space-line

model, boils down to solving this kind of problem.

And this kind of problem is called least square in this format.

And the least square solution is simply the solution to this linear equation.

We'll go through a complete numeric example towards the end of the lecture,

but this is a quick summary of where we have been, okay?

Now, you may wonder, this is kind of complicated.

Why don't I just pick bu, bi to be the average rating of all, of course, all

movies for this user an, an average rating across all users for this movie?

That turns out, that's a reasonable guess but it does not necessarily minimize our

error metric which is root mean square error.

So, all we're doing is to say, if you believe in this model, stick it into our

error term. Then, you got to solve for least squares

problem which can be efficiently solved by solving linear equation, just like Google

PageRank solving linear equation. But, over there, it's a very different set

of equations. And then, once you get the optimal bu

star, bi star, stick it back in here, you get your predictor that minimizes RMSE.

17:06

It turns out that, that's almost a good, good idea, except you need something else.

We often encounter was called overfitting. This is in the textbook in the advance

material. But basically, it says, that you can add

so much parameters, all these bu's and bi's, and you can find tune them so well

based on the ground truth and the training data that it loses flexibility and

predictive power of the actual prediction. So, we want to somehow minimize the impact

of the parameters. We don't want to rely too much on

hindsight perfection, We want to leave some room for foresight cuz it's

ultimately foresight that matters in predicting or recommending.

One way to continue the weight of the parameter is to actually minimize the

size, say, l2 norm of these bu's formed as a vector and these bi's formed as a

vector. That's called the Regularized Least

Squares. If we use quadratic norm, then l2 norm is

still a least squares, it's just a bigger least squares problem.

But, it does try to contain to much reliance on too refined parameters based

on hindsight. So, after the baseline predictor, what we

have is the following. We got the actor rui, we've got the

baseline predictor r hat ui which is the same as rui minus r bar, the lazy average

plus the optimized bu and optimized bi, based on solving the least squares.

And we call this difference a shifted r, called r tilde ui, just a shorthand

notation. In other words, we have removed in this r

hat r tilde ui, the impact of the average and the impact of the per user bias, and

per movie bias. And we're going to take this as the

starting point into our neighborhood predictor to look at similarities across

movies and across users. But before we do that, just a very quick

detour for five minutes on the notion of convex optimization.

Now, we've seen linear programming which is a special case of convex optimization

where we minimize linear objective functions over linear constraints, linear

equations and inequalities. So, pictorially, that means, you use

linear in, inequalities to cut out the space of the variables into something

called the polytope with sharp corners. And then, you slide a bar, which is a

straight line cuz it's a linear function representing your objective function

across this constraint set. So, imagine an airplan, airplane flying

above this constraint set, okay? And we want to identify the minimal value

of this objective function over this kind of constraint set.

It turns out, what's really important is that you want your objective function to

be convex, like a convex quadratic function, such as x square over this space

of constraint set that is also convex like an oval or circle.

It's convexity that determines the watershed between easy and hard

optimization problems. So, what is convex optimization?

It refers to minimizing a convex objective function.

So, we've got to define what is a convex function, subject to a convex constraint

set. And, to be computationally useful, this

must be a convex constraint set that can be easily represented by some upper bound

inequalities on convex functions. So, we have to define what's a convex set

and what's a convex function. We will encounter other forms of convex

optimization in network economics, in internet pricing, in congestion control,

in routing, and so many other topics. This is our first brief, quick

introduction to this subject. And, we'll see in this and future lectures

that convex optimization is easy in both theory and in practice.

But, we just want to take three more minutes to introduce the basic notation

before we bring them back, just in time for the future applications in later

lectures. So, what's the convex set?

What's the convex function? We can actually go into a whole course on

nothing but convex analysis, and there are such courses in Math Department.

But for today's purpose, we just need to specify with some pictures.

A convex set is basically a set where you can pick any two element in the set, and

draw a line segment in between. And entire line segment also lies in the

convex, in the set. Then, we call this a convex set.

Say, well, this sounds like all sets of convex.

Well, no. Some sets, for example, like this set, you

can pick two points in the set but part of their line segment, this part, does not

lie inside the set. And, this is a kidney-shaped thing that is

a non-convex set. So, generally speaking, we say, a set is

convex if we pick any two points, a and b, that in this set, okay?

C, pick c. Then, the entire line segment between A

and B, which can be represented as theta a plus one minus theta times b.

For some theta between zero and one, for example, when theta is half, then it is

the midpoint between a and b also lies in c.

For all such theta, that's the key word, not for some theta but for all theta.

This is the representation mathematically of the line segment between point a and b.

And, by the way, say a, b are vectors in general, in, in dimensional space So,

pictorially, it just says that, a convex set is a set where the whole line segment

lies inside. It turns out the most important geometric

property of convex set is that any two convex set that don't touch each other can

be separated by a straight line. So, that one set lies on the one side and

the other lies on the other side. Let's say, will this be true for

non-convex set too? Not really.

For example, one set is convex, the other set is non-convex.

They don't touch each other and yet, you cannot separate them with a straight line.

Of course, you can separate them one way or another cuz they don't touch.

But, you can't separate them with a straight line.

It turns out that this is the most useful property of convex set, but we don't have

time to go through the implication of that in this lecture.

Now, having specified what's a convex set, we define what's a convex function.

A convex function intuitively is a function that curves upward, okay?

24:34

Now, we know that if you look at a function f, its first derivative tells is

monotonous is increasing, is it decreasing function?

And the second derivative tells us whether it curves upward or downward.

Let's say, the function is a single variable function, f of, of scalar

variable x. Let's say, quadratic as well, we're

talking about quadratic today, is x squared, okay?

You can add a few other, a few other terms, say, 2x min, plus three, okay?

This, if you plot it, curves upwards. Or, another function, minus x square plus

2x plus three. If you plot it, it curves downward.

Both of them, quadratic function, one is a convex, the other is not.

In fact, the other, if you flip it minus f of x for this function is convex.

We call it function whose minus of that function is convex, a concave function.

So, this is convex function, this is a concave function.

And, of course, it doesn't have to be quadratic, and it doesn't have to

univariate. It could be a multivariate function.

But, the idea is that if it curves upward, it's convex, downward it's concave.

Some functions have convex and then concave like this sigmoidal function we'll

encounter in social network influence models later.

It starts out convex, at some point, so-called inflection point.

It starts to bend downward. So, this part, the second derivative is

positive. In this part, the second derivative is

negative. This curves up, this curves down, this

part is convex, this part is concave. The whole function, therefore, is neither

convex nor concave. Now, what kind of function is both convex

and concave? You say, wow.

That means you need the first, the second derivative to be both non-negative and

non-positive. The only possibility is that it is exactly

zero, that means it has no curvature. That's the only way to make it both

positive and negative. So, no curvature means it is a straight

line. It could be pointing up or pointing down

depending on the first derivative, but the second derivative is zero.

In other words, linear or affine function is the only function that's both convex

and concave. Now, these pictures are drawn on 2D,

therefore, they're talking about a single variable.

But, it's a multivariable function, now we have to talk about second derivative

slightly more evolved. Recall your multivariable calculus course.

Assume that you still remember, or can quickly brush up your memory about that.

If a function, let's say, of just say two variables, once you get hang of that, it's

the same as many variables is quadratic. X squared plus xy plus y squared, okay?

Xy are single variable but is a function of both variables.

Now, what is a second derivative of a, of a multivariable function is not a simple

expression anymore. It's not a number, it's a, it's a matrix.

28:00

With partial derivative of this function with respect to x squared and with respect

to y, twice, okay? And, with respect to both x and y, with

respect to both y and x. This is the matrix.

You have to take partial derivative with respect to x and y.

And in general, we say, given a function f of variables x1, x2, dot, dot, dot of xn.

If we look at a partial derivative, will direct to xi, and then respect to xj, this

is the ij-th entry of a matrix. That matrix is called the Hessian Matrix,

or in short, the Hessian of the function. So, in this particular case, it is very

simple. We take directive twice with respect to x,

we get two. Back to y, we get two.

With respect to x and then y, or y and then x the same thing for us is one, okay?

So, this two, one, one, two is the Hessian of this particular function, and this is

the general definition tight here. The ij-th entry of the matrix is partial

f, partial xi, and then partial f, partial xj.

29:26

So, how do we make sense out of saying that a matrix is positive?

It's not that the entries of the matrix are positive, but its about whether this

matrix eigenvalues are all positive or non-negative.

And, you can check that this matrix eigenvalues are all, not negative.

And therefore, this function is a convex function.

But in general, we say, a function is convex if it satisfy the secondary test

where the Hessian matrix that's defined is so called positive semi-definite.

And, a matrix being positive semi-definite means all the eigenvalues are positive or

non-negative. Now, you may wonder what if my function

does not have a second derivative? If Hessian matrix doesn't exist, some

function are now smooth enough that you can take derivative twice, or even once.

Well, it turns out that, that's fine too. I can have a function with a, like sharp

corner, okay? At this point, we don't have a second

derivative. But, we can still define a function as

convex or not and this will be a convex function.

However, since all our functions throughout entire course are smooth enough

that the Hessian matrix exist, we will just follow this definition, okay?