Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

Loading...

来自 休斯敦大学系统 的课程

Math behind Moneyball

35 评分

Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

从本节课中

Module 3

You will learn how Monte Carlo simulation works and how it can be used to evaluate a baseball team’s offense and the famous DEFLATEGATE controversy.

- Professor Wayne WinstonVisiting Professor

Bauer College of Business

Okay, in this video we're going to show you how to use Monte Carlo simulation to

try and estimate how many runs a team would score in an inning,

based on their hitting statistics assuming all the players were identical.

You could really generalize this.

Let's say you have a lineup of nine different players, and of course,

you got the runs in an inning.

And you want to predict how many runs they scored a game.

We know there's 26.72 outs, you would take roughly 8.9

innings in a game that translates to times runs per inning.

Okay, so let me show you the inputs we're going to use.

Here I've got the stats for the baseballreference.com, another great site.

I've got the hitting stats, at bats, just about every stat, but

all we're going to use in our simplified simulation, are these stats.

Okay, basically the singles, doubles, triples, homers, walks and hit by pitcher.

You can ignore other stuff.

Okay, here we get our simulation's going to have some simplifying

assumptions.

Although I could build the simulation I have,

which really it does not have simplifying assumptions, but

in this course I just don't want to get that advanced.

So we're going to assume when a batter makes an out,

none of the runners advance, basically.

It's like every out is a strike out.

And we'll see that'll make, basically,

our estimate of runs scored to be less than what the team actions, excuse me.

Excuse me [SOUND] sneeze.

Okay, we're going to assume a single and double always advances runners two bases.

So if you had a single and a guy's on first, he goes to third.

A guy on second always scores.

That doesn't always happen.

[COUGH] Sorry.

And when a guy hits a double,

we're assuming he always advances the runner two bases and in reality,

sometimes when somebody hits a double, you can score a runner from first base.

And you can look up these actual probabilities and built them in, but

I mean describing this simulation would take us hours.

So if you can understand how this basic simulation works with those assumptions,

you could build your own simulation.

That puts in things like there were ground outs,

there were double plays, there were flies that advanced runners.

Also, so we have no double plays here.

We have no errors.

Okay, we're assuming there are no errors.

So then basically, you're going to see our model is assuming something that makes

more runs be scored, and the fact that singles always advance guys two bases.

But no errors and outs and not advancing runner sort of makes

it a little bit on the low side when we predict the runs.

I think after trying this out many times,

our estimates are about 3 to 4% less than what the team actually scores on average.

Okay, so we can put in the team stats here.

Actually let's put in the American League average for 2014.

So I've got at bat's here, and if I go through here,

that's the American League average.

And that's scoring what, Runs in 2014.

Okay, so I can just put those in here.

So, your inputs would be, let's put that in red,

a team's stats which you can easily get.

So, 2014 American League.

Okay, the outs are just simply computed as the number of hit bats minus the singles,

minus the doubles, minus the triples, minus the home runs.

Okay, so now, we have to talk about how we model what event happens, and this

is like that football example where we had first, second, third, and fourth down.

We'll use the VLOOKUP to model things with to model things with given probability.

See, this is how many walks,

or what's the odd that somebody gets a walk or hit by pitcher?

You would take, probability, you take walks plus hit by pitcher, divided by,

you would up at bats plus walks plus hit by pitcher.

So that's basically the probability they'd walk.

The probability, for instance, they hit a home run, pretty small here.

Take the number of home runs divided by at bats plus walks plus hit by pitchers.

And then outs is going to be just everything that's not a walk, single,

double, triple or home run.

So how can we assign random numbers from 0 to 1 to each event.

Well, let's start with the Outs.

So to 0 to 0.68, is an out.

And then we'll add, I think what did I do next, I did a single.

I have a code breach event.

0 is an out, 1's a single, 2's a double, 3's a triple,

4's a home run, f5 is a whopper hit by pitcher.

So if I add that 0.15 to the 0.68, get 0.83.

So anything between 0.68 and 0.83 gives us singles that gives the right probability.

Then I add in the chance of a double, that gives me .88, the chance of a triple and

the chance of a home run.

And then everything that's left over becomes walk.

So, anything above 0.913 becomes a walk which

gives you the right chance of a walk.

Okay, so when we had that spreadsheet where we had everyone hit a home run or

an out, we just got a more complicated situation.

Well, now the trick is how do we model how things go during an inning.

Many more things can happen.

Okay, then a home run or an out.

So we have to look at the concept of the state which will be very important for

future videos.

So in baseball during an inning, how many states are possible?

How many situations?

We could have zero, one, or two outs.

That's three situations.

And the bases could be, you can either have each base empty or full.

And so, that's 2.

The first base could be empty or full.

Second base could be empty or full.

And so, that gives you, there's 24 possible situations in baseball.

Sorry, there's 24, I just do it like this.

There's 24 possible situations.

And we can list them all.

And the way I've listed them, I use the letter A if nobody's on the base.

Okay, basically, we'll have zero, one, and two outs, but

the on base situations there are 8.

An A means the base is empty, and a one means the base is occupied.

So here we need to have what's called transition probabilities.

In our case, given where we started and

what event happen, where do we go to which state.

So at this state the triple A needs nobody's on base.

One double A means man on first.

A1A, man on second.

Here's a more complicated one.

1A1, man on first and third.

Okay, so now we can look and see, given the events, again zero's an out,

one through five you can see what's listed in orange.

Here, let's look at a situation here and see if we can figure out what goes on.

So if we look at basically these five situations,

For these six events that could happen when we have a man on second.

Okay, so what can they deal me.

Okay, so zeroes it out, we'd still have a man on second, and no runs scored.

This column gives the number of runs for score.

Okay, so, if I got a single with a man on second, what's going to happen?

That guy's going to score, we get a runner on first.

Okay, now the next event is a double, so I'm going to give,

the guy on second will be knocked in.

I get a run and I've got a man on second.

Okay, now I've got a man on second here.

Again, we have that man on second.

We've got a triple so the run scores and we now have a runner on third base.

Okay, now, If I had got that man in second,

hit a home run, well I got two runs.

I knock the guy from second base and the batter score.

Okay, now if I walk I've got first and second.

We've got for every situation, okay, all the in base situation and

the outs situations will come in later.

We've got based on the event that happens, where do we go?

Okay, so how do we model the enemy?

Okay, we start out with zero outs, and we start with nobody on base.

So, in column P, I'm going to assume a maximum of 27 batters,

25 batters in an inning.

Or sorry, I guess that's 26 batters in an inning.

I just randomly picked that.

That would mean you bat around three times.

I'm not worried about that happening in a typical game.

So you pick the random number and

then you use that VLOOKUP idea to pick out the event.

So a 0.915 will be a walk because that's greater than 0.913.

A .835 would be a single.

Okay?

And etc.

Now .22 would be an out.

So I figured out the results, and

i figured out what they do in real life to the t.

Okay.

Well that's where I use a lot of what we learned about Excel.

I used the index and max functions.

And so these formulas are fairly complex,

I don't want to go through them, but basically okay, if the guy walks, okay?

Then, okay, when the original state was AAA, okay?

So, basically, you woke up in the first column, okay.

So, you look up in the first column basically, going from columns I,

if you look here where the results are what happens.

There's a range named index result and

index result refers to these columns where you know what happened after the event.

Okay so from the first column okay, the result was a buy, but I started an AAA.

This is going to pick it up.

So that was a walk, it'll put the guy on first.

It picks out the right row for this range, okay, based on the event.

See, it counts down to the on-base situation and

then moves from there based on the event.

Picks out the right situation, in the first column gives you who's on first.

The second column, who's on second?

That's why I look up in the second column.

The third column, who's on third?

I can sound like Abbott and Costello.

And then, how many runs score comes from the fourth column.

And then the total runs I just add up on each plate appearance,

how many runs were scored.

And of course we'll end the inning on the third out.

Okay, now what's the new state?

Now here's something using text function,

so we'll learn a lot more about when we do football.

Okay, because they're very important in analyzing football play-by-play, okay, or

even basketball play-by-play, I suppose.

Okay, but now the and sign means combine things.

So I know who's on first.

I know the situation at first, second, and third.

I can catenate them using the end sign, so after that single there's a man on first.

And then I have to track the number of outs, of course.

Okay, so, I take the number of outs coming into the plate occurrence and basically

if this is at zero then I add one out, because the zero means it's an out.

Okay, so there we had a walk, a single.

So now I've got first and third.

And then, okay, then I've got an out, so I've got first and third and one out.

I've got an out, first and third and two outs.

And then I've got the third out, and we didn't score any runs.

So how do I track the number of runs we score?

Well, what I gotta do is find the first row here that has three outs.

And then go one row above it, the way I've set this up,

because the first three occurs one row after the third out occurs.

So then I use Match Function again to find the first three, and I go one

row above that and from column B, I get the total runs which I have to accumulate.

So let's get an inning where we score some runs.

Somehow this is slow because every time the data table has to recalculate which

is slow.

So I'm hitting F9 to recalculate the data table.

Some time we should score some runs, I would hope.

Yeah here we go. We scored one run.

All right, so how did that happen?

Well a home run.

Okay, we made an out.

We hit a, sorry, no, we walked.

We made an out.

We walked. We got a man on first.

We made an out.

We got two outs.

Then, okay, I got a single.

I got first and third.

And then the next single, basically, I scored a run and

now I have first and third.

So that's one.

So you want to play this out 10,000 times, let's say.

These are data table structures.

Right there there's 10,000 rows.

And then here we've got the output cell,

which is going to be how many runs were scored.

Okay, and then basically I could average that and multiply by 8.9.

And that's basically my prediction for how many runs would be scored there.

Okay, so if I hit F9 and recalculate this,

I get 3.9 for doing it again, 3.89,

3.86, so I mean, I come in, a fair amount low.

Let me just make sure I've got the league average is 4.18,

and I've got, let's see 55, 42.

Got the right numbers there.

So my simulation is coming in around 3.9 runs it appears.

That's about 6% low.

There I'm getting three point eight there.

Okay.

Now, if I would put in, so if I put in the team that scored the most runs in

the American League, who would that be?

Guess it's in the Angels and

the simulation should predict more runs, and again there are no errors here.

And so clearly, it's going to come in low even though no outs advanced the runners.

You'd expect this to come in a little low, but that's not hard to fix if you

really go through the time and look up what's probably the double play,

what's probably ground out or advances a guy second to third.

What's probably a fly ball, let's the guy tag up for third, and stuff like that.

But here's the Angels who did it a bit better.

Okay, so I hit F9 there.

They'd score 4.11 runs.

4.06 runs.

4.3 runs there.

See there's a lot of variability even if I run this ten times.

But of course they score more runs than

basically 4.33, 4.26 runs etc.

But the point is you can.

You make this really very exact when you put in more events,

like you put in errors, you put in groundballs that result in double plays,

groundballs that would advance all runners.

You can put in stolen bases, you can put in the chance of an error, or

an error that usually advance all runners one base.

And if you put that in, you can get something really is very, very accurate,

and teams can use this to say, if they would trade player A for player B,

and put the player B in the lineup.

They could see how they would do.