Continuing in our discussion fo parameter estimation.

previously we talked about maximum likely of estimation, which tries to optimize

the likelihood of the data, given the parameters.

And, an alternative approach that offers some better properties, is the approach

of Bayesian estimation, which is what we're going to talk about today.

So first let's understand why Maximum Likelihood of Estimation isn't perfect.

So consider two scenarios, in the first one the team, two teams that played ten

times and the first team wins seven out of the ten matches.

So if we're going to use maximum likely estimation. The probability of the first

team winning is 0.7 which seems like an unreasonable guess going forward.

On the other hand, we take a dime out of our pocket and we toss it ten times and

it comes out heads seven out of the ten tosses.

Maximum likely destination is going to come out with exactly the same estimate,

which is the probability of the next coin coming out heads is also 0.7.

In this case that doesn't seem like quite as reasonable inference based on the

results of these ten tosses. To elaborate the scenario still further,

let's imagine that we take that same dime and now we patiently sit and toss it

10,000 times. And sure enough if comes out heads 7,000

out of the 10,000 tosses. Now the probability of heads is still 0.7

but now it might be a more plausible inference for us to make than in the

previous case, where we only had ten tosses to draw on.

And so, maximum likelihood estimation has absolutely no ability to distinguish

between these three scenarios. Between the case of a familiar setting

such as a coin versus an unfamiliar event such as the two teams playing, on the one

hand, and between the case where we toss a coin ten times verses tossing a coin

ten thousand times. Neither of these distinction is apparent

in the maximum?]. likelihood estimate.

To provide an alternative formulism, we're going to go back to our view of,

parameter estimation as probabilistic graphical model, where we have the

parameter theta over here and we have the data being dependent on the parameter

theta. But unlike in the previous case, where we

were just trying to figure out the most likely value of theta.

Now we're going to take a radically different approach.

And we're going to see that theta is in fact, a random variable by itself.

It's a continuous valued random variable. which in this case, in the case of a coin

toss, takes on value in the space, 01 but in either case it is a random variable,

and therefore something over which will maintain and probability distribution.

Now this is factored at the heart of the Bayesian formalism.

Anything about which we are uncertain we should view as a random variable over

which we have a distribution that is updated over time as data is acquired.

Now let's understand the difference between this view and the maximum

likelihood estimation view. So certainly we have as before that given

theta, the tosses are independent. But now that we're explicitly viewing

Theta as a random variable we have that if thena is unknown, then the process are

not marginally independent. So for example, if we observe that X1 is

equal ten that's going to tell us something about the perimeters, going to

increase our probability that the perimeter favors heads over tails and

therefore is going to change our probability of other coin tosses.

So the coin tosses are dependent. Marginally, not given theta but, without

being given theta they're marginal dependent.

So that really gives us a joint probabilistic model over all of the point

tosses and the parameter together. So if we break down that probability

distribution using this PTM that we have over here, it breaks down using the chain

rule for that Bayesian network that we have drawn there.

So we have P of theta which is the parameter for the roots of this network

and then the probability of the X's given theta, which because of the structure of

the network we have that they are conditionally independent given theta and

so we hav. Which this over here is just our good

friend from before, the likelihood function.