In this module, we're going to discuss Survivorship Bias and Data Snooping. We've already discussed survivorship bias in an earlier module and we'll talk a little bit about it again here. Consider the following investment. We're going to purchase an equi-weighted portfolio of the top 20 stocks in the S&P 500. Note that the stocks are chosen and fixed today. In order to get an idea of how this portfolio would perform, we decided to back-test it using historical data as follows. We're going to get the last 20 years of data return data for each of the 20 stocks. On the first day, that is the first day 20 years ago, we're going to set up the initial equi-weighted portfolio. If the stock didn't exist back then, for example Google, then omit it from the portfolio and just form an equi-weighted portfolio of the stocks that did exist that day. Every month we're going to rebalance the portfolio so that it remains equi-weighted and we're going to take transaction costs into account. What we will then do finally, is that we will plot the annual net returns that is rt against t, where rt is the net reutrn to time t realised over the previous year. So we have the following question. Do you think the plot will be representative of the future performance of the investment? Well, we're going to get a plot like the following. So we're going to have time down here. So this is today, this is 20 years ago, so this point is 20 years ago, this is also the 0% return so we've got r1, standing for R one year so the one year return over here and, so any point in time are going to get some sort of plot like this and for example at this point, the value here which is some value over here. This corresponds to the realized return a day t over the previous year. So back to t minus 1 year. The question is, do you think that this plot will be representative of the future performance of the investment? Well the answer is certainly not. Why not? Well, what we've actually done is we've introduced survivorship bias into this problem, and we've introduced an enormous amount of it because what we've done is we've picked our portfolio today here, and we picked the 20 best stocks in the S&P 500. Now we go back 20 years ago, and what we've done is we're back-testing the performance of an equi-weighted portfolio of these 20 stocks, but we've actually chosen those stocks by implicitly looking into the future. We've gone forward 20 years to today, pick the best performing stocks. After all, the top 20 stocks in the S&P 500 have surely performed very well over the last 20 years. And so we've actually introduced an enormous amount of buys into our back-test. This is equivalent to today to going 20 years into the future, picking the best 20 stocks in the S&P 500, 20 years into the future and actually trading those stocks today, that's an example of survivorship bias. It's an extreme example of survivorship bias, and it should be clear to everybody what we have done and why it is wrong. But actually survivorship bias crops up an awful lot in finance and not always as obviously as we've seen in this example here. So it needs to be born in mind by investors, risk managers and so on. People always need to be on the, on the lookout for it. So here's another example of survivorship bias in action. It's called the football game scam. Sometimes it's called the horse racing scam, when the context is changed to horse racing. We're going to stick with the football person. On each of 10 consecutive Wednesdays you receive a letter predicting the winner of a big football game the following Sunday. We're going to assume here that a football game is either won or lost, and that there are no ties in a football game. Each week the prediction was correct. In week 11 however, a letter arrives but this time it seeks payment of $10,000 before revealing the prediction for the next game. The question is what should you do? Should you pay the $10,000 or should you ignore it? Well this is a, this is an example of a scam the answers you should ignored and the reason is as follows. So what's going on here is the following. The scam artist, the person perpetuating the scam has, playing the following game. In week 1, the scam artist had a total population of 2 to the power of 10 people. To half of this group he sent a letter saying team A would win, and to the other half he sent a letter saying team B would win. For week 2, let suppose that team A won, this means 2 to the power of 9 people saw a prediction in week 1 that was correct. He now splits this 2 to the power of 9 people again into two groups, half of them get a prediction for week, for team A winning, the other half get a prediction for team B winning. If you come to week 3, and now you've got 2 to the power of 8 people remaining, who actually were correct, or have got the correct prediction on week 2. And you keep going on like this until week 1, week 1 there's only 2 to the power of 1 people remaining or rather week 10, I should have said. So in week 10 there's 2 to the power 1 people remaining which is equal to 2, you were one of them. You received a letter saying team A would win, another person received a letter saying team B would win. Presumably team A won which meant that after week 10 you were the only person and there's just one person remaining and that is you. So you're the ultimate survivor here, you're the one survivor out of 2 to the power of 10 people who see the correct prediction 10 weeks in a row. There is no skill here, no skill whatsoever. And so this is an example of extreme survivorship bias, where you see the track record of only one person and that track record is perfect, and it's your track record 10 weeks in a row of perfect predictions. We're now going to discuss an example of data snooping. Data snooping arises in many context not just in finance but beyond finance and very often the problem with data snooping are quite subtle and hard to spot. So we're going to see an example here. A bank has 4 years worth of daily historical returns data, on the Dollar. British Pound exchange rate. It employs the following mechanism for generating a trading strategy. It first normalizes the entire return data so that it has mean zero, and variance one. Now just to point out here, normalizing data is a standard and well justified statistical technique in general. We're going to use 75% of the data for training, 75% of the data actually corresponds to approximately 750 returns. And that's because there's approximately 250 trading days in the year. So if you take out holidays and weekends. We find that the typical year has about 250 trading days. So if you take 75% of 4 years, you're going to get approximately 750 trading days, or 750 returns. The remaining 25% of the data set, which corresponds to one year, so approximately 250 days, is kept at what's called a hold-out test set. We're going to use this test set to evaluate whatever strategy is yielded by the training data. The trading strategy appears to be a great success. On any given day it uses the returns of the previous 20 days to forecast the direction of the next day's return. However, the trading strategy performs very poorly in practice. The question is, why? So let's think about this for a moment. What we have done is we've split up our data set into training data and test data, so let the following be our training data, so we have about 750 observations in our training data. So starting from one up to 750. So this is date 1, date 2, date lets say t minus 20. We've 20 days in between up to day t. Now imagine the following, so we said that the trading strategy is based on the previous 20 days of returns. So suppose we're here at day t, this is day t plus 1, and we want to know the return from t to t plus 1. Let's call this mean return, mu t to t plus 1. We've seen, let's say, mu t minus 20 up to t. So we know at day t, we know this quantity here. This is known as a day t. ,Suppose for example that this is greater than 0, maybe much grater than 0. What does that tell you about mu t to t plus 1. The return you expect between dates t and t plus 1. Well if you think about it you should see that this implies that mu t to t plus 1 will be less than 0 on average, why is that? Well the reason is because of this normalization we did. We actually normalized the 1000 data points that have mean zero, so if all 1000 data points had mean zero, and I've got 20 data points here which have got a strictly positive mean, in fact I said here that it's much greater than zero. Then that means the remaining data points must have a mean return that's less than zero, in particular the return from t to t plus 1, must also be less than zero. And so in fact, this trading strategy we've determined will tell us maybe that we should sell on day t, knowing that we expect to make money from t to t plus 1, because the mean between t and t plus 1 will be negative. So, this might seem to be a small bias. But it is a bias, nonetheless, and it can actually mess things up. Now, this is what's happened with the training data. Because of our normalization process, we've actually introduced a bias into how we determine this trading strategy. So now, what about the test set? Well, the test set is meant to be an independent set of data. It has 250 returns, and the idea behind the test set is, that it should be completely unpolluted, if you like. It should not have been polluted by the act of coming up with a trading strategy. However, it's subtle, but it has been polluted. And the reason is as follows. Suppose I test my trading strategy, which I've determined up here using the training data. Suppose I test that strategy on the test set. Again let's say I'm at date t, I go back to date t minus 20. My trading strategy says, look at what's happened on the previous 20 days. So I'm just summarizing the previous 20 days by the mean return over those 20 days. There's other factors inside these 20 days as well that could also be part of the strategy. But let's suppose then on, on this particular day t, this mean over the previous 20 days, let's say it's less than 0. Well, what does that say for days t to t plus 1? Well, it says in this case, that mu t to t plus 1, the return you expect from t to t plus 1 must be greater than 0. Why is that? Well, the reason is because the test set was part of the overall normalization scheme. We actually included the tests set and training set, combined them together and normalized that data, so that the mean of all the training and test data is zero. That means, that if this mean over these 20 days is negative, even though it's only 20 days out of the total of 1,000 days, it does mean that the rest of the data must have mean greater than 0. In particular, the return from t to t plus 1, we would expect to be positive, and that's why we have that. And so the trading strategy will be, if you like a mean reverting strategy. If the previous 20 days are negative we buy. If the previous 20 days are positive resale. And so this test set will actually justify the use of the trading strategy. We will see that we'll make money on this test set. And the big problem here is the following. We normalized the entire return data set, and this was a mistake. A test set should have had nothing to do with the trading strategy. It should of been kept entirely separate from the entire process of finding the trading strategy. Only when we find the trading strategy, do we bring the test set in and use the test to evaluate the trading strategy. But we didn't do that here. We made the mistake of actually including the test set when we normalized the data. It might seem like a very small issue, but it is a real issue and it will introduce a real bias. It will make the strategy look better than what it really is. And the test set will have failed here because it would not have been an independent test set. It will have been used as part of the learning process. The learning process being the process used to generate a learned, good trading strategy. What we should have done is just normalize the training set. If we just normalize the training set, then the test that would have been completely uncorrupted or unpolluted. And over here, this behavior which are identified would not have been true. And so presume-, presumably the test that would have found out that there was indeed a problem with the trading strategy would be determined on the, on the training set. These examples crop up an awful lot in finance. Many banks and funds over the years have looked for trading strategies in similar manners and introduced small but still significant biases into their trading strategy's and into their learning strategies for developing trading strategies in this manner. The conclusion is, one always needs to be aware of introducing these biases, and when you're keeping a test set, the test set must be completely independent of the process that was used to generate the trading strategy. There are many other examples of statistical biases and difficulties that arise in finance. Survivorship bias and data snooping are everywhere and one does need to be aware of this. There are many other examples that we can discus, we don't have a great deal of time to do so in this course. And let, let me just mention a couple of other examples where biases arise here or statistical difficulties arise. Here's an interesting question, just how likely is a 25 standard deviation move? Now the reason I bring this question up here is that at the beginning of the financial crisis in August 2007. There was a very large move in the market, some funds lost an awful lot of money. And some participants reported that they actually saw a 25 standard deviation move and they used the size of this move to justify the size of their losses. Well, in fact it did more than that, in fact said they saw 25 standard deviation move, not just once, but several days in a row. So, a quick question, how likely is a 25 standard deviation move? Well, let me tell you. You can easily estimate this using Monte Carlo, and using something called, a partial sampling of Monte Carlo. A 25 standard deviation move, assuming a normal distribution. So the probability done a normal distribution. Let's say it's n zero one, but it could be n mew sigma squared. It doesn't matter that I'm assuming zero one. Probability that this is greater than or equal to 25. Which is 25 standard deviations, is approximately equal to 3.05 by 10 to the power of minus 138. So just to emphasize this is equal to .000305 and yes, there are 137 zero's in this expression. So the probability of a 25 standard deviation move is absolutely infinitesimal. To give you some sort of comparison, the number of particles in the observable universe. Depending on who you ask it's something on the order of 10 to the power of 78 or maybe 79, or 80. But it's some sort of number like this. So, this is an enormous number. But it is in fact dwarfed by 3.05 by ten to the power of 138. So you can see that this number is extremely small. So if somebody reports a 25 standard deviation move, you have to ask the question. Were they just exceedingly unlucky, or perhaps their model was wrong? I think the, I think the answer is self evident. How likely is a 25 standard deviation move several days in a row? Well making the heroic assumption that these moves are independent, you would end up with a number like 3.05 by 10 to the minus 38. And it it's 3 days in a row say, well you have to cube that, and if you cube it you'll get an answer like 2.7 by 10 to the minus 413. So I thing it is fair to say that in August 2007, it was inaccurate to say you saw 25 standard deviation move several days in a row. It much more likely that your model is wrong and very wrong at that. Alright, another example versus the [unknown] problems arise is in the are of retailed structured products. I'd like to be able to see more about structured products. Products in the course but we don't have time. So I'll just state the following; retail structured products are if you like exotic securities that are sold to retail investors. So these could be investors with just $10,000 or 10,000 Euro to invest. Their bank manager or their financial advisor suggests a structured note. And a, a structured note works like a bond, there's usually a payoff after 3 or 5 years where you redeem your principal. Maybe you spend $10,000 on day 1, 5 years later you get your $10,000 back and in-between you get a coupon. And this coupon is tied to the performance of another asset, often the equity markets. Why do I say there are biases in statistical problems here? Well these structured products are often designed to look better than they are. They tend to inverably back-test very well. So for example, any structured product that is long Apple. In other words, maybe what I'm saying here is that the coupon that you get increases as a function of the returns on Apple. Well, in that case, any structured product that is long Apple will presumably back test very well from 2000 onwards. Why is this? Well, this is a, a plot of the return on Apple. Actually, from 1985 up till today. But from 2000 onwards, I think, was around $7 at this point. And you see it's gone up to $700, almost $700 at 1 point, now it's below 500. But either way, there's been a massive run up in Apple over the last 10 years. And so, if you hold Apple in your structured product. Or rather, your the coupon that you received from your structured product depends on the performance of Apple. Well then, it 's going to look very good when you back-test that. There is also hidden risks that investors and structure products they're often exposed to often exposed to volatility risk we haven't reset much about volatility yet to the end of the course. Obviously the rose the area of pricing options which is on the binomial model but we didn't explicitly talk about volatility risk. We'll talk about that later in the course. Investor's structured products were often exposed to credit risk as well. You're relying on the credit of the issuer of the structured note. If you think this is insignificant, then there's two words for you. Those two words are Lehman Brothers. Lehman Brothers also issued structured notes, and sold them on to retail investors. Well, many people lost money when Lehman Brothers went under. So there is, there can be significant credit risk here as well. Another problem with structured products is that there is no secondary market available to them. So if you purchase a structured product or invest in a structured product. Then, you better be able to hold onto it until maturity. Because if you can't and you need to sell it. Before maturity, there would be no secondary market. The only person you'd be able to sell it to is the bank that issued it to you in the first place. They'll know you're looking to sell and you will not get a good price. So, there's a lot of problems with structured products maybe they've got 1 or 2 positive aspects, but overall they're, they're I think investors should be very careful about investing in them for all of the reasons I mentioned here. Finally, we're going to end with a, a toy example or a play example. It's called the Monty Hall problem. I don't mention it here. This, it's got nothing to do with finance, however, even just a couple of years ago was discussed quite a lot in the Financial Times. There were some articles and letters written to the Financial Times about this problem. It often raises a great deal of confusion. And so we'll discuss it here too, because it provides a great example of how a seemingly simple problem can confuse people. So it should serve to highlight the fact that while statistics and issues with averages and biases, don't require advanced mathematics, not advanced mathematics at all. It comes to be very confusing, and one does need to be aware of these issues in practice. So what is the Monty Hall problem? The Monty Hall problem is as follows. There are three closed doors. A goat lies behind two of the door. And one million dollars lies behind the other door. You don't know which door has the one million dollars and so you have to guess the door. If you guess correctly, you actually are going to get the one million dollars. If you guess incorrectly and you open the door with the goat behind it, then you're going to get the goat. Before the door is open, Monty Hall opens a different door. So what happens here is maybe you guess door number 1. At this point, Monty Hall comes along and opens one of the other two doors. And the door he opens will have a goat behind it. And this is always possible, because two of the doors have goats, so even if you've guessed incorrectly, then one of these two doors will have the goat. And so Monty Hall will open that door. Okay, so Monty Hall opens a different door. The door always has a goat behind it. And now he gives you the option to change your mind and pick another door. The question is, should you change your mind? So, just to be clear here, suppose you start off, you guess door one. Then Monty Hall opens door three and shows you a goat behind it. You now have the option of changing your mind. You can stick with door one or you can change and go with door two. The question is, should you change? Well, I'm not going to give you the answer to that question. I'll let you think about that. There is a definite answer. If you're not sure, let me give you a hint. Consider the situation where there are 100 doors. One, two, up to 100 doors, and that there's a goat behind 99 of these doors. The game plays, the game is paid as follows. You pick a door, and then Monty Hall opens 98 doors, all of which have goats behind them. Should you change your mind them? The answer to that question is the same answer to the question of the three door case.