Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

Loading...

来自 University of Houston System 的课程

Math behind Moneyball

36 个评分

Learn how probability, math, and statistics can be used to help baseball, football and basketball teams improve, player and lineup selection as well as in game strategy.

从本节课中

Module 5

You will learn basic concepts involving random variables (specifically the normal random variable, expected value, variance and standard deviation.) You will learn how regression can be used to analyze what makes NFL teams win and decode the NFL QB rating system. You will also learn that momentum and the “hot hand” is mostly a myth. Finally, you will use Excel text functions and the concept of Expected Points per play to analyze the effectiveness of a football team’s play calling.

- Professor Wayne WinstonVisiting Professor

Bauer College of Business

Okay, we're going to begin our study of football.

We're trying to understand what makes NFL teams win.

Now the data's a bit old and it takes a bunch of time to put together this data so

I don't think we could have time to update it.

But okay, so you want to know what makes NFL teams win?

So we're going to use regression.

So you've got to have a dependent variable.

So y equals a team's scoring margin during the season.

Scoring differential for the whole season.

So for example the 2003 chiefs, I think they had

that great running back Priest Holmes, and my son watching [INAUDIBLE] too.

They won by 152 points for the 2003 season.

That's about nine points a game.

Now what verticals could you use to predict that?

The margin.

So I came up with eight of them.

So we've got eight independent variables.

They're outlined in the little handout.

So let's go through what they are.

Okay, return touchdown differential.

In other words, how many return touchdowns you had minus your opponent's?

The Bears at this time Devin Hester and that was worth something,

I mean because he ran back all those kicks for touchdowns.

Penalty differential,

how many yards of penalties on you minus yards of penalties on your opponent?

Now that probably is going to hurt your scoring margin because

a positive means you have more penalties than your opponent.

Passing yards per attempt.

This is the most important variable.

And this goes back to a great statistician, Bud Goode,

who I think nearly 50 years ago in the early 1960s said,

yards per pass attempt is the key to winning in the NFL.

He was right then, and he's probably even more right now.

Unfortunately he passed away in 2010.

So passing yards per attempt, you would take the total yards gained, and

you have to add sacks in there, because a sack is a pass attempt.

So they gained 5,000 yards on passes and were sacked for 400 yards.

They had 4,600 yards net, and the sack counts as an attempt.

And you need running yards per attempt to do that, turnover differential or

sorry turnovers that your team had, defensive turnovers.

And then the same stats for the defense.

Running yards per attempt also I forgot.

Defensive pass yards attempt, what you yielded to your opponent.

Defensive running yards per attempt, okay, what you yielded, and defensive turnovers.

So the idea is, if you know these eight variables can you

predict how much a team will outscore the other team by.

There's not much here on special teams, okay?

But I guess I better figure out some way to put that in.

But this does pretty well.

So we're going to run a regression and we'll see the NFL was all about passing.

And this data goes through 2006, which is almost ten years ago and we know

the NFL is more about passing because teams pass more than they did then.

But if we go to Data Analysis added,

we talked about regression earlier in the course.

Okay, so now I already think I have this entered in here.

Wide range, let's go Group NFL, we'll call this.

Let's just reset the settings I should put this in from scratch because

there are a couple of unusual settings here in gold to use.

I'm going to use the constant term to be zero here just to show you could do it.

See if all these independent variables of zero it would mean your team is exactly

as good as your opponents and you would predict the margin should be zero.

If when all your independent variables are zero you have a reason to predict

the dependent is equal to zero, then I would set the constant equal to zero.

Okay, so the y range is going to be margin.

That's the [INAUDIBLE].

The x range

is going to be these columns.

With the little button, let's do labels, constant of 0.

This is in the regression sheet last range but

I mean we can put it, down here is probably okay.

We can check the Residual box to look for outliers,

I don't think we're going to really need that here.

Okay, so I think I get an R-squared of 87%.

So we explained 87% of the variation in score margin.

Okay, so again what we should look for is which P values are less than 0.1.

And everything is less than 0.1 by a mile.

So for example here,

the P value for passing guards per attempt is 10 to the minus three.

So that's 0.29 0 and a 2.

So passing yards are extremely important.

Now let's try and interpret these coefficients.

And again the biggest coefficient here is the 62.

You get one more yard per pass attempt on offense, adjusting for

everything else, you'll win by 62 more points, which is four points a game.

One more yard on running attempts and going from four to five yards on

a run seems more important than six to seven yards on a pass, but

it's only worth 26 points.

An offensive turnover will cost you three points, around 2.8 points.

A defensive turnover will help you by 3.5 points.

Giving up one more yard on defense passing per attempt, will cost you 68 points.

Having the weak pass defense seem to be a little bit more costly than having

the good pass offense.

And giving up one more yard of rushing will cost you 23 points, etc.

100 yards of penalty extra will cost you 5.6 points.

All right, six points.

Okay return touchdown's worth about three points.

You might say it should be worth more than that but don't forget that's

after adjusting for a turnover and a lot of those are pick sixes.

So part of the effect of that return is in the defensive turnover.

Okay, so we explained 87% of standard error is 35 points.

So 95% of our forecasts

Are accurate within 70 points or four points a game, and

there's a lot missing here like red zone defense.

I don't have any stats on that.

But basically this is treating every yard you give up passing and

rushing basically as the same no matter where it occurred and

we all know red zone defense is really crucial.

Punting is important, good field goal kickers are becoming more important,

too important in my opinion.

So now how can we sort of show in some different ways, that really,

passing offense and defense are the key?

Okay, well, you can get what's called a correlation between

each of your independent verbals and your dependent verbals.

Okay, so for instance I'm going to look at, you got it right here.

So correlations are measures of linear association, okay?

And they're explained in the Data Analysis This Is Modeling book,

I'll give you a chapter reference.

Let me just check that.

So the chapter reference on correlations is chapter 56.

And so a correlation is a unit free measure of linear association

So correlation between two columns of data.

It's between minus 1 and plus 1.

And the closer it is to plus 1 means

the stronger the positive linear relationship between the two variables.

The closer to minus 1 the stronger the negative linear relationship.

And I'll give you a more concrete interpretation in a second.

But if we look at the scoring margin,

how does it correlate to passing yards per attempt?

Now there's a function called in Excel.

So I want to correlate the scoring margin with passing yards per attempt.

That's this column.

Oops.

Got that messed up there.

We want to column

C With column G.

Sorry, it's not typing very well here, we want to

column C and then I need a comma with column G.

So there's our comma.

Then I'll come up here to column G.

And that's .66 correlation, fairly close to plus 1.

Now since I screwed up typing that I'm going to try a little trick here.

I'm going to copy that formula because it's just really simple to change it here.

So I want defensive passing yards per attempt.

All I gotta do is make the Gs into Js because that's the right column for that.

And that's minus 0.5.

You expect the negative correlation, give up more yards passing.

You won't do as well.

Running yards per attempt is column H.

And you'll see the running yards per attempt just isn't as strongly

correlated or related to the margin as the passing.

.20.

And if I go right here, defensive running yards is column K.

And that's just not correlated really, your rushing defensive is just not that

correlated with your scoring margin, but your passing is.

Now what does a coefficient of .66 here mean?

It means basically after adjusting this correlation.

So this is without adjusting for

other variables, but it's in terms of standard deviations.

If you would raise your passing yards per attempt by one standard deviation,

Your scoring margin would increase by 0.66 standard deviations.

And similarly if you looked at improving your rushing yards per attempt by one

standard deviation,

you'd only increase your scoring margin by point two standard deviations.

Okay, now one standard deviation better than average on something means

you're in the 84th percentile.

We learned that from the normal random variable a couple of videos ago.

So going from the 50th to the 84th percentile in passing yards per attempt,

has a basically triple the effect as going from the 50th percentile

to the 84th percentile on running yards per attempt, okay?

And so basically,

that again confirms our view that passing is more important than running in the NFL.

Now there's another way to look at this,

just run a regression using the rushing data to predict the scoring margin.

So I did that in the worksheet rushing.

And the R-squared is like 6%, okay?

The p values are significant, but basically your rushing offense and

defense explains 6% of your scoring margin.

Now all I use are the two passing numbers, yards per attempt on offense passing and

defensive net yards per attempt on passing.

I can explain 70% of the variation in scoring margin.

That's almost everything, the whole regression explained 87 percent.

So basically how you do on yards per pass attempt on offense and how you do on yards

per pass attempt given up on defense, that just about explains everything in the NFL.

Now I'll give you a homework problem where I give you some college football data and

you'll see there rushing is much more important than it is in the NFL and

I think that makes sense, the teams run more and

basically I think the runs are used to set up the pass more.

But there's really not much evidence here that you need to run to set up the pass.

That may be true, but I think even in today's NFL,

it would be less true than it was.

But this is a very powerful use of regression to try and

understand what makes a team win.

When we get to basketball, we'll talk about Dean Oliver,

who's the analytics coordinator for the Sacramento Kings now and

was the head of ESPN statistics for a while.

He wrote the book Basketball on Paper, which is a book you should look at.

And basically, Dean came up with the four factors model,

four factors that explain success in basketball, and we'll get to that later.

In our next video, we'll talk about the famous NFL quarterback rating and

how regression can sort of unravel what really matters

when you're computing the NFL quarterback rating.

And we'll talk about some of it's flaws and

we'll try it come up with a better way to rate,

compare quarterback using Brian Burke's great site,

advancedfootballanalytics.com.