All right. So now,

when you understand what matrix factorization is,

the next question is how to actually compute it.

There are various ways to decompose a matrix into a product of several other matrices,

and we will cover only some of them in the course.

We will start from one of the most famous matrix decomposition methods,

namely singular value decomposition,

or SVD for short.

SVD has long history in information retrieval.

Latent Semantic Indexing, LSI,

Latent Semantic Analysis, LSA,

Principle Component Analysis, PCA,

all of these methods have a straightforward connection to SVD.

Here is a very important fact from linear algebra.

Any real matrix A,

can be exactly represented in the form of a product of three matrices: U,

sigma and V transpose.

Matrices U and V are orthogonal,

which means that their columns called left and

right singular vectors are also orthogonal and have unit length.

Or equivalently, the product of

the transpose of the matrix with itself gives an identity matrix.

This property is very handy,

and soon you will see why.

In turn, sigma is a diagonal matrix.

Its diagonal elements are called singular values,

and by convention are sorted in the descending order.

Now, you know how singular value decomposition looks like.

Recall that you actually do not aim to ideally reconstruct the matrix A,

otherwise, our model would simply fail to generalize.

As we have previously discussed,

what you aim for is to find

just a good enough low-rank approximation that will help us to generate recommendations.

For these purposes, you can use the so-called truncated SVD,

which contains only a small number of the largest singular values,

and corresponding singular vectors.

This number denoted as r, is called the rank of SVD.

The largest singular values ensure that you get the most descriptive latent features.

It also turns out that among all the possible solutions,

truncated SVD gives the best rank r approximation,

in terms of the Frobenius norm.

In other words, it gives an optimal solution to the following minimization task.

The only problem that I haven't told you about yet,

is that SVD is undefined for incomplete matrices like in our case.

Luckily, you can do a little trick.

Simply impute unknown entries with some predefined values.

Possible choices are raw average,

golden average, some combination of these,

or even simpler just zeros.

In the latter case, you get the so-called pure SVD model.

Sure, this way the result values are going to be biased towards zero,

as you replace approximately 99% of the data or even more with zeros.

This is not good for rating prediction.

However, it's not a big problem for top rank recommendations task.

In fact, in some cases,

it works quite well,

and can even beat other state of the art methods based on the matrix factorization,

or the neighborhood models.

Here, you can see such an example computed on the Netflix data set.

Different methods were evaluated in

two different set ups corresponding to regular recommendations,

and recommendations from the long-tail.

In both setups, the authors were able to beat all the other methods with pure SVD,

by tuning just a single parameter,

the rank of decomposition.

What a remarkable result!

I have to make an important note here.

As you can see, there are

some other factorization methods having SVD letters in their names.

Please do not be confused.

Strictly speaking, most of these methods like Funk SVD,

SVD++ and other SVD something,

are variants of a matrix factorization approach,

and have nothing in common with

the mathematical formulation of the singular value decomposition.

Unlike SVD, these methods do not build a space of singular vectors,

and do not compute singular values.

Most of them do not preserve the columns orthogonality property.

These methods are designed to work with incomplete matrices,

often ignoring unknown entries.

They form a broader family of methods with different optimization objectives,

specifically tuned to provide a better ranking or accurate rating prediction.

But due to historical reasons,

they are still sometimes called SVD.

Speaking about the orthogonality,

do you remember I told you that this is a convenient property?

And in the introduction,

I promised that not only neighborhood

based models allow to generate new recommendations very quickly.

It's time to connect these two propositions.

Once you have computed SVD,

it is easy to generate recommendations for the known users.

But what if you have a new or unrecognized user,

who is not present in the training data,

but have already provided at least some ratings?

Computing the SVD each time you have

a small update in your data would be prohibitively expensive.

Likely, you can work this with minimal efforts.

Imagine that a new user ratings are encoded into a new row of the matrix A.

This row update in A would most likely change all three matrices,

U, sigma and V. However,

if the number of users is sufficiently large,

you can assume that this change is approximately

accumulated just in a single row of matrix U,

while all the other values stay unchanged.

From here, you can approximately find the latent vector q,

for the new user.

That's where the column orthogonality of the vector matrices plays a crucial role.

This approach is known in the literature as folding-in.

It allows you to quickly find a new user embedding in the latent feature space.

As previously, you can use this latent vector to

find closest neighbors and do some other analytics.

However, you can go in further.

If now you make a reverse separation and try to restore actual matrix values,

after a few simple linear algebra operations,

you will get the following expression.

This formula is very convenient.

It states that we can compute our recommendations

just by a sequence of two vector multiplications.

Even more, if you look carefully,

you can see that in the case when our new user vector

is equal to a particular row of the original matrix,

which means that this user is similar to one of the known users,

then the prediction is not approximate but exact.

In other words, we can always use this formula

for both known and new users and get correct results.

Those of you who are familiar with linear algebra,

have already recognized that this formula is just a projection of

a new user preferences onto the latent feature space.

The complexity of this calculation is proportional to

a product of the number of items and the number of latent features,

which can be used to generate recommendations almost instantly.

Note that once you have a lot of new users,

it might be necessary to recompute the SVD,

in order to build a more accurate model.

However, this can be done in a separate offline pipeline,

and doesn't affect generation of recommendations.

Let me summarize.

You are now familiar with the single value decomposition,

and features that make it so special.

You now know how to use it to build a recommended system based on the pure SVD model.

Finally, you've got some intuition behind the folding-in technique,

and know how it can be used to generate recommendations almost instantly,

without the need to recompute your model too often.