0:21

Named after the mathematician Bayes, whose work was then, further extended by the

famous mathematician Laplace. And Laplace raised a very interesting

question, what's the chance that the sun will rise tomorrow if you have observed

that it has been rising every morning in the past, say, 100 million days?

Now, you may think this is a funny question.

But, for now, ignore the fact that you know something about the underlying

physics of why the sun rises. Purely view this as a question that you

observe. Something has been coming up every morning

the last 100 million days, and then what's the chance you predict it will come up

again tomorrow? Can you say, well, can be exactly 100

percent mathematically? Again, we're not talking about underlying

physics. About this is such an overwhelming number.

So, what should it be? And this is an interesting thought

experiment that we can simplify little further to say, suppose I give you a

sequence of n experiment and I say, s out of n experiment, for s is smaller than n

I've served one. Okay?

And I ask you that question, what's the chance the next experiments also return

one? Now, without going through the foundation

of probability theory. Intuitively, the answer is just s over n.

That's in the past s out of n chances runs give us this result of one as, well, our

observation. So, call this question number one.

Suppose now, I switch to another question and say that, also run n experiments.

And the experiment is actually that there's a coin that's loaded, and I flip

it. If it's heads, then I return one.

If it's tail, then I return zero. And again, s out of the last n runs, I

observe one. I'll you ask the question, what's the

chance that next time you'll also see one? You may say, hold on, isn't this question

the same as the last question? Shouldn't the answer also be s over n?

Well, not according to the Bayesian view of probability.

The Bayesian view is a very powerful and for some time, a controversial view in

probability theory community. But, what it says is that, now that I've

given you some prior information that the experiment consists of flipping a loaded

coin, then you'll be able to make some model about that based on the observation

in the last n rounds. And therefore, your answer may be

different. What is that answer?

Let's derive that in the next five minutes and I will come back to answer this

question, why is it different from s over n again?

The essence of the Bayesian view, the philosophical underpinning is captured in

this picture. You got underlying model.

Later in the course, we'll also see different kind of latent factor model,

we'll see hidden model, we'll see reverse engineering of network topology, as well

as protocols. Philosophically, they follow a similar

spirit. In this simple question, the underline

model is just captured by a parameter p. And as the chance that a single coin flip

of this loaded coin will result in head, and therefore observation one.

Now, different piece will clearly lead to different observations.

That goes without saying, but Bayesian view also says different observations

telling me something about p. And, more observation I have, the better I

can build a model for p from which I can then do forward engineering to predict the

future outcome. You first reverse engineer the p before

you make prediction. So, in our case, we say that, if you know

the value of p, then we know that the chance of seeing s out of n flips heads is

simply a binomial distribution. It is p to the power s, cuz observed s

such cases. One minus p to the power n minus s, cuz we

observed n minus one of such cases, and there are edge whose asked possible

[unknown] arrangements of that sequence of s out of n being one.

So, this is just a binomial distribution, and we all know that for a fixed p.

What is now flipping our heads around and turning the table is that, since that's

the case, then the probability distribution of p, and let's call that f

of p, should be proportional to this observation frequency.

If we count the frequency in the observation of heads, then that will tell

us something about the underlying probability distribution of this value of

p. And let's just pause for one second

because it sounds intuitive, it's actually counter-intuitive for a site.

Cuz we're now saying that let's go ahead and predict, we're saying that let's build

a model of p and that model says that p's distribution should be proportional to our

observation. This proportionality principle is what

Laplace did to extend base understanding of the relationship between observation

and model. And I say it's proportional because the

self is not a, a distribution. We have to normalize it by integrating

overall the possible p's and the p there, a p can range from zero to one.

And now, this is indeed a probability distribution, okay?

And as a function of p, that's the probability distribution f of p.

So, all we need to do now is to evaluate this integral and then do the division

skipping those detailed steps, cuz that's not what we care about in this course.

We get the following answer. It's n plus one factorial over s factorial

n minus s factorial times p to the s, n minus one minus p to the n minus s.

7:11

Okay? So, that is the probability distribution

of p as a function of n and s. As a function of, how many runs have you

carried out and what is the number of ones that you have observed?

And now, we're basically done because we know the conditional probability of

observing another one given a particular p value is just p, right?

Because we know what p is. So, condition on that, the chance that

we're getting, a coin flipped of head is just p.

So, we have to unconditionalize it by integrating this answer p, across the

probability distribution f of p for p from zero to one.

And, substituting that complicated looking expression f of p and doing the integral

against getting the steps from integral tables, the answer is remarkable and

remarkably simple, s + one over n + two. Let's recap.

We started by saying that, you want mid prediction, right?

But, hold on first. You know something about a model.

So, based on your observation, first view of the model around p, and that's how we

get to the distribution of p. Then, you make prediction and the answer

is s + one over n + two which is clearly not s over n.

So, mathematically, you know this is the right answer if you follow Bayesian

analysis. But, intuitively, why not s over n?

Well, here's one way to intuitively make yourself feel comfortable.

The moment I tell you this prior information about a model, it's as if I've

told you that I've run for free two trials.

One trial shows up heads, one trial shows up a tail.

So, when you tally you should say, it's n plus two trials.

And the number of heads I've observed is s plus one, okay?

Because the impact of that prior knowledge about the underlying model is, roughly

speaking, intuitively speaking, equivalent to as if you had free runs.

Two of them, one showing head, one showing tail.

9:42

So now, we can take this philosophy to understand how to do ranking, of competing

product of the same category on Amazon. And we know that knowing heads shows up

100 out of 103 runs is going to give us something different than observing ten

heads out of thirteen runs. Similarly, on Amazon, we know that knowing

there's a lot more reviewer of a product compared to its competitor should give us

some information. So, suppose you've got a list of products

indexed by i. For each product i, there are n sub i

reviews. And, the average, simple naive average is

r sub i, okay? Then, we say, look across all of these

products in the same category. You can add up the ni's, call that total

number of review for all the products in the category N, okay?

And then, you can look at the reviews, summation of niri, okay?

Divided by N, that is the average, total naive average across all product category,

call that R. So, if I know a certain product i got a

lot of review, ni, relative to R, I should put more trust to the corresponding ri.

Whereas, if there is very few reviews ni relative to r, then, well relative to n,

I'm sorry, very small ni relative to n, then I would rather put the trust on the

average review, okay? It's like saying that if this product got

a lot of reviews, I should trust its own average.

But if it got too few reviews relative to n, then I should pretend as if it is an

average product across the entire category.

11:53

And, it turns out that the Bayesian adjusted value, ri tilde, follows the

following expression. It is NR, which is really is the summation

of niri, plus niri divided by N plus ni. So, we can view this as a sliding ruler.

On the one end is your product i's own average review, ri, on the other end is

the total average across all products in the category.

And we say that the actual ri tilde, the Bayesian adjusted rating average is

somewhere between the naive average of your own product and the total product

depending on how much i can trust this ri. If this ri has a lot of ni backing it up,

I'll move closer to ri. If it's got very few of them, then I'll

just say, you know what? I think I would rather not trust this ri

that much and more towards R, the average rating for all the products.

So, ri tilde is sliding somewhere in between and its sliding according to this

formula. Let's take a look at two extreme cases.

In extreme one, one extreme cases, this ni for a certain product, is so small that we

can ignore it compared to the N. Says, nearest zero.

And say, this is one and N's ten trillion. Alright.

So, we can pretty much ignore this term and this term.

And in that case, I ri tilde is just R. It moves all the way to this extreme end.

Or lets say, ni for certain product, is n over two.

So, half of the other reviews in this category goes to this particular product.

So, relative to the other competing products, we can say this product, i, got

a lot of reviews. So, we should be able to trust more this

rating. So, if we substitute ni over equals N over

two into this expression, what we see is that, ri tilde now equals two-third R plus

one-third ri. So, it's moving much closer to ri now.

And it turns out that this formula is used exactly in Internet Movie Database, IMDBs

ranking of the top 250 movies of all times based on all the ratings of these movies,

okay? It doesn't use a simple naive average

because some movies got a lot more reviews and ratings than other movies.

Instead, it used this Bayesian adjusted, NR + niri over N + ni.

The relative sample size of the review, ni, relative to the total across all the

movies plays a role here. Here's another example, Beer Advocate.

This is a website that ranks all the different kinds of beers around the world.

And it use almost the same formula, okay? Each Bay i's, Bayesian adjusted ranking r

tilde i equals N min, I'll explain what this is in a minute, times R plus niri

over N min plus ni. Where N min is the minimal number of

rating that you need in order to even show up in the top beer ranking on this

website. So, why use N min instead of N which is

the total number accumulated number of ratings?

One reason is that, if you use this N, as time goes by, the total n would only

increase in the dynamic range of r tilde i will be shrinking.

So, Beer Advocate decided to use a twist of this formula and say, instead of

looking at the total number of ratings, just look at the minimum ratings that you

need in order to show up on the website. So, in fact, you can pick this as N, as N

min, or as some other kind of number that somehow corresponds to a notion of the

total review population. What about Amazon then?

Does Amazon use Bayesian analysis in this ranking?

And, what else that it use? We'll come to this in the next and the

final segment of this lecture's video. But first, let me highlight to draw back

some limitations of this Bayesian ranking. Number one, this is useful for ranking.

16:55

Therefore, you must have a backdrop of a scale of all competing products number of

ratings. If you just want to do adjustment of the

number of stars, say, five, 4.5 stars from 121 reviews, you can do that, okay?

You must also know, what about the competing products?

How many reviews and what's the naive number of stars that they received in

order to do adjustment, in order to lead to a ranking of the product?

So, that's why on Amazon's home page, you can click through a number of links to

look at the ranking which, actually, encompass Bayesian.

But, if you look at just number of stars, they still show you the simple naive

average number. And you just have to process in your own

head, by, in calculating the impact or number of ratings somehow according to

your own notion of n. The second limitation is that, this

formula implicitly assumes actually a Gaussian distribution of the review

scores. As we mentioned, that often they have a

bipolar, bimodal distribution. Or in general, multimodal distribution in

which case would lead to a slightly different and more complicated formula.

But, we will not have time to go through that set of material.

So, what we're going to do now is going to go through two examples on Amazon, a small

and a larger example to figure out what Amazon does and how does it use the

Bayesian viewpoint to do ranking.