0:00

As you may remember from our course introduction in October 2006,

Â Netflix announced to competition with a million dollar prize for

Â an algorithm that could beat its existing Cinematch algorithm by at least 10%.

Â I'm happy to be speaking with Yehuda Koren,

Â a key member of the team that won the Netflix prize in 2009.

Â Dr. Koren is now staff research scientist at Google in Israel.

Â At the time of the Netflix prize, he was a researcher at AT&T Labs and then at Yahoo.

Â Yehuda, welcome to our course.

Â >> I'm glad to be here.

Â >> So let's start at the beginning.

Â Could you remind us about the nature of the Netflix Prize Challenge,

Â and what your team's first effort was at solving it?

Â >> Yes, this bring some good memories.

Â So it happened in October 2006,

Â Netflix published a dataset with one other enumerating

Â step that the center of the real customers,

Â half million customers gave to too many thousands

Â of movies over the six preceding years.

Â And at the time, it was considered a used dataset.

Â It was a game changer for many of us, and

Â it attracted a lot of attention as probably thousands of researchers and

Â practitioners was started play was the data.

Â 1:30

The goal that Netflix has put is to improve the accuracy of the Cinematch

Â system by 10%.

Â And 10% was measured on an error metric known as RMS root mean squared arrow,

Â where they want us to predict the score for

Â each user and the movie in the test set.

Â And the two-way thing is a star rating between one and five.

Â So imagine their two way thing is false stars and I predict 3.5,

Â and the square, they're always is 0.5 squared.

Â And they have read overall writings, and

Â the requirement was that there would be around 0.85.

Â And this was very tough.

Â 2:24

I think no one realized how difficult it is to reduce the error to this level,

Â otherwise, people won't participate.

Â It took almost four years to achieve this level.

Â But I say, I believe that like everyone will optimistic at the beginning

Â that we can close this in few months.

Â And then, by the way, we were totally strangers to the field.

Â But I think that the nature of the dataset,

Â the fact that it deals with more of which everyone

Â is familiar with got us really excited.

Â 3:05

And we started counting the data.

Â What we did in the beginning was white standard.

Â I'm coming from computer science.

Â I was working on graph algorithms, so I said, okay, whatever, big graph.

Â This is not the best presentation, of course, but

Â it's a graph connecting users with movies, so

Â I think this is naturally related to what's known as Megawood methods.

Â So let's optimize them, and let's see if what I wanted to do in the beginning.

Â I definitely chose to find longer range interaction between notes in the graph.

Â I ended up tuning the Megawood methods so they optimized the RMS in metrics.

Â So when we tried to predict a rating for

Â a movie from other ratings for similar movies [INAUDIBLE],

Â the weights we give to other movies will not be arbitrary,

Â but will be driven by some optimization method that

Â it's supposed to be good at optimizing the RMS.

Â I was working then at AT&T, as mentioned, I was working mostly with Bob Bell.

Â 4:26

Bob is a statistician, who was working with the other approaches.

Â He used PCA, which is very similar to singular value

Â decomposition and- >> PCA means Principal Component Analysis.

Â >> Yes, Principal Component Analysis, and this was, you can see an that its an early

Â version of message authorization which became really popular in this competition.

Â And so you asked at the beginning, so

Â metric book factorization came up as a big factor in analyzing and

Â decrypting the Netflix Prize that is set.

Â 5:08

And a key to using metrics factorization was to

Â address only the existing entries in the metrics.

Â So we have a matrix with 500 rows corresponding to users and

Â almost 20,000 calls corresponding to movies.

Â 99% of the endless in the matrix are missing.

Â And if usual incident component analysis of a singular

Â varied composition will address all entities, which is how to scale.

Â And also, not a good idea for

Â this specific metrics that we were competing on.

Â And we learned in gradually starting with PCA,

Â which was treating all entries in the matrix.

Â And so essentially, it is transposed in that, it is training as

Â movie by movie metrics by multiplying it with its transposed.

Â So this PCA that Bob did got combined with the enabled methods I was working on.

Â 6:23

So another major component was understanding the data,

Â what we call the global effects, residing in the data.

Â Global effects are like the average score that the movie receives.

Â Some movies are just better than others.

Â And you can say with confidence that the good movie will get the high rating

Â without knowing anything about the writer.

Â [CROSSTALK] >> Some people just like movies more than

Â others.

Â >> Yes, exactly.

Â That's the other side of the coin.

Â And some people also are less critical than others and

Â tend to give higher ratings.

Â So if it was unknowing nothing about the movie, just say for

Â this writer, then anything is going to be high.

Â So this is just that, I call it data cleaning because you say,

Â okay, let's remove these effects, and try now in a model of a cleaner data.

Â So this was the big part of what we did in the beginning.

Â And by the way, later on, we stop doing this,

Â because it was better to have these effects baked in into the model.

Â So the model is part of learning the Megawood model or

Â the methods factorization model also land most of these passes.

Â But in the beginning, we didn't realize this and we were cleaning the data.

Â And cleaning the data is important, in general, I always advise it.

Â >> Great. >> So these were small steps.

Â >> Let's jump ahead for a minute.

Â >> Yeah. >> Because we could take you through all

Â the steps, but over time, a bunch of different teams that were

Â competing seemed to come to the same realization.

Â That the techniques they were using were making progress but

Â they weren't going to get them to this magic 10%.

Â And they seemed to figure out that the key to solving this

Â challenge was to find a way to merge together the different algorithms that

Â each had different strengths and weaknesses in to some composite algorithm.

Â How did that insight come about and

Â what was sort of the general high level approach that pulled everything together?

Â >> Yes, I said this is probably the one thing that people decides matrix

Â factorization.

Â I think, yes, people remember phonetics competition,

Â the insane number of predictors that we have used which is a shame.

Â [LAUGH] No project assistant should use so

Â many predictors, and so this happened gradually.

Â 9:12

So, I mean the alternative would be to really isolate what is unique

Â in what Bob is doing and what's unique in what I'm doing and

Â then find the single predictor, combining the merits of both.

Â And this is very difficult, and

Â far less effective than combining predictors, so this was reality.

Â 9:33

And as you move on, you are generating more and more predictors,

Â I think after a few months, we have ten predictors, we thought that's a lot,

Â so now we need to be systematic about how we combine them.

Â So, we use linear regression to find the best linear combination

Â of all predictors, some of them could get negative weights.

Â But I was fortunate to work with a statistician like Bob that ensures

Â that we are not overfitting that that's it, that the training said too much.

Â 10:12

As time went on, I mean we accumulated more and more predictors.

Â And then yes at some point, it became apparent that we are not looking for

Â the romantic key insight or key algorithm that wins this competition but

Â we are looking for a big combination of methods.

Â Because no single elegant method can beat a combination of

Â many methods that cancel out the noise inherent at each predictor.

Â 12:09

People have good writing days and bad writing days.

Â And this is a very slow effecting the data.

Â And if you don't model it no matter how many models you blend,

Â you're not going to hit the required precision level.

Â And then there are some subtler effects like it's

Â very important to know how many items the user rated in a given day.

Â This is just something in the Netflix that it might be an artifact.

Â 12:38

Maybe it represents the fact that at some point,

Â Netflix asked user to rate many items, maybe some people tend to rate in batches.

Â But this has a profound effect on certain movies,

Â was that they were being rated alone or as part of a big batch.

Â So all this was key to the progress that we made slowly.

Â That's why it took several years,

Â because these effects were hard to find and unexpected to us.

Â 13:55

And this was another surprise for us and it was integrated into

Â the model and also was critical in imputing the precision.

Â >> And certainly one of the things this gave was a way for

Â somebody who had a hypothesis.

Â If somebody thinks, gee, movies on Sunday might get happier and

Â therefore higher ratings.

Â They could build a very simple predictor.

Â And if it had any value, then mixing it into this ensemble could show if it

Â had incremental value without having to solve the whole problem themselves.

Â If they could just find little things that added incremental value,

Â there was a framework to build that value.

Â >> Yes, yes, where the human distributed the computation this way.

Â I mean, we could split the work among seven people,

Â each of them is working independently at the end.

Â 14:49

There is a single blending algorithm, it was very convenient and effective.

Â It was over-used certainly.

Â >> But it was also sort of recursive in that, if you think about how recommender

Â systems work, instead of finding the perfect data, we find lots of data and

Â a way of combining it in the hopes of giving you a good prediction [LAUGH].

Â >> [LAUGH] That's a good point.

Â Yes, so this is crowdsourcing at two levels at least.

Â >> Yeah, absolutely.

Â So the folks in this course, our learners, have heard me talk a lot about the Netflix

Â prize, both about the fact that it was over-optimizing the wrong thing.

Â That error measures for things people rated is not actually the most

Â important business question for a company like Netflix.

Â But also, the fact that it brought significant machine learning research

Â effort into recommender systems.

Â From your perspective, looking back, how has this prize and challenge

Â changed the field both for better, and if you think it has at all for worse?

Â 16:03

I mean I did this because I was motivated by the competition

Â but maybe matters a little bit to Netflix because this

Â shows actual predicted rating to people [INAUDIBLE].

Â But in all scenarios I saw It doesn't add much value, and

Â it's not even correlated with [LAUGH] with the real metric.

Â So it's, first of all, ranking, even if we talk about accuracy.

Â I mean you hinted that accuracy metrics in general are not the most important,

Â which might be true.

Â But I mean for us, data people, that's what we know to do.

Â So product decisions might be much more important but this is beyond us.

Â 17:36

But then later I published a paper that for [INAUDIBLE].

Â The simple dense SVD on the dense metrics is working much better for

Â the real ranking problem.

Â So yes it's, I think there is the [INAUDIBLE],

Â metrics is misguided in several ways.

Â And community, including me, was obsessed with it for several years.

Â But this was fine.

Â I think one year after the Netflix competition

Â people started publishing about other methods.

Â About learning to rank, about diversification metrics,

Â about all sorts of method.

Â I think there is still a lot of work to do in finding the right metric.

Â And being obsessed about other for

Â several years still bought a lot attention to the field.

Â Many good researchers, tons of papers,

Â many sessions in conference and spawned books.

Â It made the field popular, which is very good.

Â And in my eyes,

Â the competition was a big positive boost to the computer systems field.

Â >> Well, it certainly grew this from a field where people would get together.

Â You'd have 150, 200 people.

Â And suddenly, it was 400, 500 people.

Â 19:33

And then I was I was of the view a graph visualization researcher before.

Â It totally changed my research directions towards

Â data science which I am finding fascinating.

Â That's what I am doing since then.

Â And not only working on a problems, but

Â problems that involved modelling and analyzing big data.

Â That's what I am doing and so this Netflix competition

Â 20:34

>> I'm sometimes frustrated about metrics, it's very hard to design metrics.

Â It totally captures the of the like to popular items.

Â Sometimes it's good, sometimes bad.

Â So I'm still looking for people to come out with good,

Â with ways to measure the quality,

Â the user impact of systems.

Â 21:43

>> And the challenge of saying, yeah, these are objectively better but

Â they still might not be better when the user gets them.

Â >> Yes. I see it again and again.

Â The offline metrics are misleading us every time.

Â >> Wonderful.

Â Thank you so much for joining us and

Â we will remind people that this has been an interview with Yahuda Coren at Google.

Â See you next time.

Â >> Thank you. [BLANK AUDIO]

Â