In this module, we're going to discuss Survivorship Bias and Data Snooping.

We've already discussed survivorship bias in an earlier module and we'll talk a

little bit about it again here. Consider the following investment.

We're going to purchase an equi-weighted portfolio of the top 20 stocks in the S&P

500. Note that the stocks are chosen and fixed

today. In order to get an idea of how this

portfolio would perform, we decided to back-test it using historical data as

follows. We're going to get the last 20 years of

data return data for each of the 20 stocks.

On the first day, that is the first day 20 years ago, we're going to set up the

initial equi-weighted portfolio. If the stock didn't exist back then, for

example Google, then omit it from the portfolio and just form an equi-weighted

portfolio of the stocks that did exist that day.

Every month we're going to rebalance the portfolio so that it remains equi-weighted

and we're going to take transaction costs into account.

What we will then do finally, is that we will plot the annual net returns that is

rt against t, where rt is the net reutrn to time t realised over the previous year.

So we have the following question. Do you think the plot will be

representative of the future performance of the investment?

Well, we're going to get a plot like the following.

So we're going to have time down here. So this is today, this is 20 years ago, so

this point is 20 years ago, this is also the 0% return so we've got r1, standing

for R one year so the one year return over here and, so any point in time are going

to get some sort of plot like this and for example at this point, the value here

which is some value over here. This corresponds to the realized return a

day t over the previous year. So back to t minus 1 year.

The question is, do you think that this plot will be representative of the future

performance of the investment? Well the answer is certainly not.

Why not? Well, what we've actually done is we've

introduced survivorship bias into this problem, and we've introduced an enormous

amount of it because what we've done is we've picked our portfolio today here, and

we picked the 20 best stocks in the S&P 500.

Now we go back 20 years ago, and what we've done is we're back-testing the

performance of an equi-weighted portfolio of these 20 stocks, but we've actually

chosen those stocks by implicitly looking into the future.

We've gone forward 20 years to today, pick the best performing stocks.

After all, the top 20 stocks in the S&P 500 have surely performed very well over

the last 20 years. And so we've actually introduced an

enormous amount of buys into our back-test.

This is equivalent to today to going 20 years into the future, picking the best 20

stocks in the S&P 500, 20 years into the future and actually trading those stocks

today, that's an example of survivorship bias.

It's an extreme example of survivorship bias, and it should be clear to everybody

what we have done and why it is wrong. But actually survivorship bias crops up an

awful lot in finance and not always as obviously as we've seen in this example

here. So it needs to be born in mind by

investors, risk managers and so on. People always need to be on the, on the

lookout for it. So here's another example of survivorship

bias in action. It's called the football game scam.

Sometimes it's called the horse racing scam, when the context is changed to horse

racing. We're going to stick with the football

person. On each of 10 consecutive Wednesdays you

receive a letter predicting the winner of a big football game the following Sunday.

We're going to assume here that a football game is either won or lost, and that there

are no ties in a football game. Each week the prediction was correct.

In week 11 however, a letter arrives but this time it seeks payment of $10,000

before revealing the prediction for the next game.

The question is what should you do? Should you pay the $10,000 or should you

ignore it? Well this is a, this is an example of a

scam the answers you should ignored and the reason is as follows.

So what's going on here is the following. The scam artist, the person perpetuating

the scam has, playing the following game. In week 1, the scam artist had a total

population of 2 to the power of 10 people. To half of this group he sent a letter

saying team A would win, and to the other half he sent a letter saying team B would

win. For week 2, let suppose that team A won,

this means 2 to the power of 9 people saw a prediction in week 1 that was correct.

He now splits this 2 to the power of 9 people again into two groups, half of them

get a prediction for week, for team A winning, the other half get a prediction

for team B winning. If you come to week 3, and now you've got

2 to the power of 8 people remaining, who actually were correct, or have got the

correct prediction on week 2. And you keep going on like this until week

1, week 1 there's only 2 to the power of 1 people remaining or rather week 10, I

should have said. So in week 10 there's 2 to the power 1

people remaining which is equal to 2, you were one of them.

You received a letter saying team A would win, another person received a letter

saying team B would win. Presumably team A won which meant that

after week 10 you were the only person and there's just one person remaining and that

is you. So you're the ultimate survivor here,

you're the one survivor out of 2 to the power of 10 people who see the correct

prediction 10 weeks in a row. There is no skill here, no skill

whatsoever. And so this is an example of extreme

survivorship bias, where you see the track record of only one person and that track

record is perfect, and it's your track record 10 weeks in a row of perfect

predictions. We're now going to discuss an example of

data snooping. Data snooping arises in many context not

just in finance but beyond finance and very often the problem with data snooping

are quite subtle and hard to spot. So we're going to see an example here.

A bank has 4 years worth of daily historical returns data, on the Dollar.

British Pound exchange rate. It employs the following mechanism for

generating a trading strategy. It first normalizes the entire return data

so that it has mean zero, and variance one.

Now just to point out here, normalizing data is a standard and well justified

statistical technique in general. We're going to use 75% of the data for

training, 75% of the data actually corresponds to approximately 750 returns.

And that's because there's approximately 250 trading days in the year.

So if you take out holidays and weekends. We find that the typical year has about

250 trading days. So if you take 75% of 4 years, you're

going to get approximately 750 trading days, or 750 returns.

The remaining 25% of the data set, which corresponds to one year, so approximately

250 days, is kept at what's called a hold-out test set.

We're going to use this test set to evaluate whatever strategy is yielded by

the training data. The trading strategy appears to be a great

success. On any given day it uses the returns of

the previous 20 days to forecast the direction of the next day's return.

However, the trading strategy performs very poorly in practice.

The question is, why? So let's think about this for a moment.

What we have done is we've split up our data set into training data and test data,

so let the following be our training data, so we have about 750 observations in our

training data. So starting from one up to 750.

So this is date 1, date 2, date lets say t minus 20.

We've 20 days in between up to day t. Now imagine the following, so we said that

the trading strategy is based on the previous 20 days of returns.

So suppose we're here at day t, this is day t plus 1, and we want to know the

return from t to t plus 1. Let's call this mean return, mu t to t

plus 1. We've seen, let's say, mu t minus 20 up to

t. So we know at day t, we know this quantity

here. This is known as a day t.

,Suppose for example that this is greater than 0, maybe much grater than 0.

What does that tell you about mu t to t plus 1.

The return you expect between dates t and t plus 1.

Well if you think about it you should see that this implies that mu t to t plus 1

will be less than 0 on average, why is that?

Well the reason is because of this normalization we did.

We actually normalized the 1000 data points that have mean zero, so if all 1000

data points had mean zero, and I've got 20 data points here which have got a strictly

positive mean, in fact I said here that it's much greater than zero.

Then that means the remaining data points must have a mean return that's less than

zero, in particular the return from t to t plus 1, must also be less than zero.

And so in fact, this trading strategy we've determined will tell us maybe that

we should sell on day t, knowing that we expect to make money from t to t plus 1,

because the mean between t and t plus 1 will be negative.

So, this might seem to be a small bias. But it is a bias, nonetheless, and it can

actually mess things up. Now, this is what's happened with the

training data. Because of our normalization process,

we've actually introduced a bias into how we determine this trading strategy.

So now, what about the test set? Well, the test set is meant to be an

independent set of data. It has 250 returns, and the idea behind

the test set is, that it should be completely unpolluted, if you like.

It should not have been polluted by the act of coming up with a trading strategy.

However, it's subtle, but it has been polluted.

And the reason is as follows. Suppose I test my trading strategy, which

I've determined up here using the training data.

Suppose I test that strategy on the test set.

Again let's say I'm at date t, I go back to date t minus 20.

My trading strategy says, look at what's happened on the previous 20 days.

So I'm just summarizing the previous 20 days by the mean return over those 20

days. There's other factors inside these 20 days

as well that could also be part of the strategy.

But let's suppose then on, on this particular day t, this mean over the

previous 20 days, let's say it's less than 0.

Well, what does that say for days t to t plus 1?

Well, it says in this case, that mu t to t plus 1, the return you expect from t to t

plus 1 must be greater than 0. Why is that?

Well, the reason is because the test set was part of the overall normalization

scheme. We actually included the tests set and

training set, combined them together and normalized that data, so that the mean of

all the training and test data is zero. That means, that if this mean over these

20 days is negative, even though it's only 20 days out of the total of 1,000 days, it

does mean that the rest of the data must have mean greater than 0.

In particular, the return from t to t plus 1, we would expect to be positive, and

that's why we have that. And so the trading strategy will be, if

you like a mean reverting strategy. If the previous 20 days are negative we

buy. If the previous 20 days are positive

resale. And so this test set will actually justify

the use of the trading strategy. We will see that we'll make money on this

test set. And the big problem here is the following.

We normalized the entire return data set, and this was a mistake.

A test set should have had nothing to do with the trading strategy.

It should of been kept entirely separate from the entire process of finding the

trading strategy. Only when we find the trading strategy, do

we bring the test set in and use the test to evaluate the trading strategy.

But we didn't do that here. We made the mistake of actually including

the test set when we normalized the data. It might seem like a very small issue, but

it is a real issue and it will introduce a real bias.

It will make the strategy look better than what it really is.

And the test set will have failed here because it would not have been an

independent test set. It will have been used as part of the

learning process. The learning process being the process

used to generate a learned, good trading strategy.

What we should have done is just normalize the training set.

If we just normalize the training set, then the test that would have been

completely uncorrupted or unpolluted. And over here, this behavior which are

identified would not have been true. And so presume-, presumably the test that

would have found out that there was indeed a problem with the trading strategy would

be determined on the, on the training set. These examples crop up an awful lot in

finance. Many banks and funds over the years have

looked for trading strategies in similar manners and introduced small but still

significant biases into their trading strategy's and into their learning

strategies for developing trading strategies in this manner.

The conclusion is, one always needs to be aware of introducing these biases, and

when you're keeping a test set, the test set must be completely independent of the

process that was used to generate the trading strategy.

There are many other examples of statistical biases and difficulties that

arise in finance. Survivorship bias and data snooping are

everywhere and one does need to be aware of this.

There are many other examples that we can discus, we don't have a great deal of time

to do so in this course. And let, let me just mention a couple of

other examples where biases arise here or statistical difficulties arise.

Here's an interesting question, just how likely is a 25 standard deviation move?

Now the reason I bring this question up here is that at the beginning of the

financial crisis in August 2007. There was a very large move in the market,

some funds lost an awful lot of money. And some participants reported that they

actually saw a 25 standard deviation move and they used the size of this move to

justify the size of their losses. Well, in fact it did more than that, in

fact said they saw 25 standard deviation move, not just once, but several days in a

row. So, a quick question, how likely is a 25

standard deviation move? Well, let me tell you.

You can easily estimate this using Monte Carlo, and using something called, a

partial sampling of Monte Carlo. A 25 standard deviation move, assuming a

normal distribution. So the probability done a normal

distribution. Let's say it's n zero one, but it could be

n mew sigma squared. It doesn't matter that I'm assuming zero

one. Probability that this is greater than or

equal to 25. Which is 25 standard deviations, is

approximately equal to 3.05 by 10 to the power of minus 138.

So just to emphasize this is equal to .000305 and yes, there are 137 zero's in

this expression. So the probability of a 25 standard

deviation move is absolutely infinitesimal.

To give you some sort of comparison, the number of particles in the observable

universe. Depending on who you ask it's something on

the order of 10 to the power of 78 or maybe 79, or 80.

But it's some sort of number like this. So, this is an enormous number.

But it is in fact dwarfed by 3.05 by ten to the power of 138.

So you can see that this number is extremely small.

So if somebody reports a 25 standard deviation move, you have to ask the

question. Were they just exceedingly unlucky, or

perhaps their model was wrong? I think the, I think the answer is self

evident. How likely is a 25 standard deviation move

several days in a row? Well making the heroic assumption that

these moves are independent, you would end up with a number like 3.05 by 10 to the

minus 38. And it it's 3 days in a row say, well you

have to cube that, and if you cube it you'll get an answer like 2.7 by 10 to the

minus 413. So I thing it is fair to say that in

August 2007, it was inaccurate to say you saw 25 standard deviation move several

days in a row. It much more likely that your model is

wrong and very wrong at that. Alright, another example versus the

[unknown] problems arise is in the are of retailed structured products.

I'd like to be able to see more about structured products.

Products in the course but we don't have time.

So I'll just state the following; retail structured products are if you like exotic

securities that are sold to retail investors.

So these could be investors with just $10,000 or 10,000 Euro to invest.

Their bank manager or their financial advisor suggests a structured note.

And a, a structured note works like a bond, there's usually a payoff after 3 or

5 years where you redeem your principal. Maybe you spend $10,000 on day 1, 5 years

later you get your $10,000 back and in-between you get a coupon.

And this coupon is tied to the performance of another asset, often the equity

markets. Why do I say there are biases in

statistical problems here? Well these structured products are often

designed to look better than they are. They tend to inverably back-test very

well. So for example, any structured product

that is long Apple. In other words, maybe what I'm saying here

is that the coupon that you get increases as a function of the returns on Apple.

Well, in that case, any structured product that is long Apple will presumably back

test very well from 2000 onwards. Why is this?

Well, this is a, a plot of the return on Apple.

Actually, from 1985 up till today. But from 2000 onwards, I think, was around

$7 at this point. And you see it's gone up to $700, almost

$700 at 1 point, now it's below 500. But either way, there's been a massive run

up in Apple over the last 10 years. And so, if you hold Apple in your

structured product. Or rather, your the coupon that you

received from your structured product depends on the performance of Apple.

Well then, it 's going to look very good when you back-test that.

There is also hidden risks that investors and structure products they're often

exposed to often exposed to volatility risk we haven't reset much about

volatility yet to the end of the course. Obviously the rose the area of pricing

options which is on the binomial model but we didn't explicitly talk about volatility

risk. We'll talk about that later in the course.

Investor's structured products were often exposed to credit risk as well.

You're relying on the credit of the issuer of the structured note.

If you think this is insignificant, then there's two words for you.

Those two words are Lehman Brothers. Lehman Brothers also issued structured

notes, and sold them on to retail investors.

Well, many people lost money when Lehman Brothers went under.

So there is, there can be significant credit risk here as well.

Another problem with structured products is that there is no secondary market

available to them. So if you purchase a structured product or

invest in a structured product. Then, you better be able to hold onto it

until maturity. Because if you can't and you need to sell

it. Before maturity, there would be no

secondary market. The only person you'd be able to sell it

to is the bank that issued it to you in the first place.

They'll know you're looking to sell and you will not get a good price.

So, there's a lot of problems with structured products maybe they've got 1 or

2 positive aspects, but overall they're, they're I think investors should be very

careful about investing in them for all of the reasons I mentioned here.

Finally, we're going to end with a, a toy example or a play example.

It's called the Monty Hall problem. I don't mention it here.

This, it's got nothing to do with finance, however, even just a couple of years ago

was discussed quite a lot in the Financial Times.

There were some articles and letters written to the Financial Times about this

problem. It often raises a great deal of confusion.

And so we'll discuss it here too, because it provides a great example of how a

seemingly simple problem can confuse people.

So it should serve to highlight the fact that while statistics and issues with

averages and biases, don't require advanced mathematics, not advanced

mathematics at all. It comes to be very confusing, and one

does need to be aware of these issues in practice.

So what is the Monty Hall problem? The Monty Hall problem is as follows.

There are three closed doors. A goat lies behind two of the door.

And one million dollars lies behind the other door.

You don't know which door has the one million dollars and so you have to guess

the door. If you guess correctly, you actually are

going to get the one million dollars. If you guess incorrectly and you open the

door with the goat behind it, then you're going to get the goat.

Before the door is open, Monty Hall opens a different door.

So what happens here is maybe you guess door number 1.

At this point, Monty Hall comes along and opens one of the other two doors.

And the door he opens will have a goat behind it.

And this is always possible, because two of the doors have goats, so even if you've

guessed incorrectly, then one of these two doors will have the goat.

And so Monty Hall will open that door. Okay, so Monty Hall opens a different

door. The door always has a goat behind it.

And now he gives you the option to change your mind and pick another door.

The question is, should you change your mind?

So, just to be clear here, suppose you start off, you guess door one.

Then Monty Hall opens door three and shows you a goat behind it.

You now have the option of changing your mind.

You can stick with door one or you can change and go with door two.

The question is, should you change? Well, I'm not going to give you the answer

to that question. I'll let you think about that.

There is a definite answer. If you're not sure, let me give you a

hint. Consider the situation where there are 100

doors. One, two, up to 100 doors, and that

there's a goat behind 99 of these doors. The game plays, the game is paid as

follows. You pick a door, and then Monty Hall opens

98 doors, all of which have goats behind them.

Should you change your mind them? The answer to that question is the same

answer to the question of the three door case.