This video is about Two Stage Least Squares. So Two Staged Least Squares is a method for estimating causal effect when you have an instrumental variable. So we'll look at what two stage least squares is and discuss why it works. So before we get to two stage least squares, let's quickly review ordinary least squares. And also think about why it wouldn't work, when there's confounding. Well, or especially when there's unmeasured confounding. So suppose we have treatment A and outcome Y, and then you have this simply model here, where y is equal to beta zero plus a times beta one, plus some random error. Let's imagine that we were going to apply ordinary least squares kind of method to this. Well the usual assumption that you make is that the error term and the covariate and A are independent. So typically, you assume that the random variable A is independent from the error term, this epsilon type of variable. So this is a kind of assumption I think in linear regression kinds of classes that it doesn't usually get a lot of attention. They say, the errors are independent. And possibly normally distributed with constant variance. But the independent part a lot of times doesn't get a lot of attention, but it's actually critical if you're interested in causal effects. So we need for this error term to be independent from treatment in this case. If it wasn't, if they were correlated, that what would happen if there's confounding, is they would be correlated. So in other words, treatment assignment, people would be. So people who would get treatment would tend to be possibly people that tend to have a higher value of the error term, for example. And so this would be cause of problem. So if this confounding A and this error term would be correlated, and ordinarily least squares would fail. In the sense that, if you were to estimate beta one using least squares, it would not represent a causal effect, right, because there's confounding here. Confounding means that A and the error term are correlated, and also you could think of this as, in economics, I would say A is an endogenous variable. So endogenaety, in this case, you could think of this as meaning that there is confounding or that A and the error term are correlated. So we can't just apply ordinary least squares. And in fact, even if we included confounders in this model, if there's unmeasured confounding, we would still have this same problem. So now we'll get into, we'll think about what two stage least squares is, and why it might work. So, two stage least squares is a method for estimating a causal effect in an instrumental variables setting. So first, we'll assume that Z is a valid instrumental variable, so it affects treatment and the exclusion restriction is met. And so, what is stage 1? So, two stage least squares is well named, because there's two stages. So stage 1, what we'll do is we'll regress the treatment received, A, and the instrumental variable, Z. And so here, our error term is seems to be independent, means 0 constant variance. And we randomize Z, or we've assume Z is randomized. It's an instrument we've assumed it's randomized. So, because of that Z and the error term should be independent. So we have what looks like a standard kind of linear model. So Z related to A. And then we can obtain a prediction. So we can estimate these two parameters, we have two parameters, alpha 0 and alpha 1. We can estimate those using least squares. And we'll call that alpha-hat, alpha-zero-hat and alpha-one-hat, and then we could obtain a predicted value of A for each person. So we'll call that A-hat. So A-hat i is a predicted value of treatment for subject i. And that can be just written as alpha 0 hat plus Zi times alpha 1 hat, so Zi is the value of Z for person i. So in other words, what we're doing here is we're getting a predicted value of A given Z. So for person i for example, to have a particular value of Z. And then what we're recording this A hat is what we would predict based only under Z, as what their treatment would be, right? So this isn't treatment received, this is just based on their instrument, based on the value of the instrument. What do we think, what really did we think they would get? So that's what A-hat is. So this is in stage one, as you just carry out a standard regression, a standardly squares, then you can just get a predicted value of the outcome. The outcome here is treatment received. So if the stage one is very standard, you can just use regression software and get a predicted value for each person. In stage two, we're now going to regress the outcome on the fitted value that we obtained from stage one. So, you'll notice here we have this A-hat. So this looks like a standard regression model except the variable, the regressor here in this variable is actually A-hat. And not the original treatment variable, is what we predict treatment would be based on Z. Okay, so then we have this error term, which we assume has mean 0 constant variance. But because of the exclusion restriction, Z should be independent of A. I mean, I'm sorry, Z should be independent of Y given A. Right, because Z only affects Y through its effect on treatment. So if you condition on treatment, Z should not be affecting Y. And technically speaking, A-hat is a projection of A onto the space spanned by Z. That's more of a technical point from these squares in general, but the point I'm making here is that A-hat itself will be independent. It functions a lot like Z in the sense that it should be independent from Y. So A-hat, because we've done this projection, because we're using A-hat and not the original A. So remember, A and Y, of course, this confounding of this A Y relationship. But that shouldn't be the case with A-hat anymore. Because A-hat is really, it's like A given Z is a projection. And so this A-hat should be unconfounded when it comes to the relationship with Y. So hopefully that'll make more sense on other slides as well. But we have confounding with A itself, between A and Y. But A hat, it's just determined from Z. And Z is randomly assigned, Z has the exclusion restriction. So we should be okay there. And it turns out that if you were to estimate beta one using least squares from this second stage model, that would represent a causal effect. And so we're going to look at this in a little more detail as to why that's the case. But just some practice is this is exactly what you would do is you would carry out. And the first stage you would regress A and Z, you'll get a predicted value out of it. You'll use that predictive value, when a second stage model regress Y and their predicted value. And then the coefficient of that predicted value would be a causal effect. So in practice, how to actually implement this? It's pretty straightforward if you've fit regression models before. And this is known as a two-stage least squares estimator. So you have two stages of least squares. So we're going to try to motivate a little more why this ends up with. Why you end up with getting a causal effect out of this. So let's just think about the simple case where Z and A are both binary. Right, so Z is encouraged, yes or no? And A is treatment, yes or no? And in stage one, A-hat is just an estimate of the expected value of A given Z. Right, that's whatever regression is. Essentially, so that's what A-hat is. So you could also think of it as representing the probability of treatment given a particular value of Z. So A-hat I is the probability of treatment given personalized value of Z. And in stage 2, we have this regression model. So then what is the interpretation of beta 1? Well, in regression if you remember the slope in a regression model has to do with a one unit change in the predictor, in the value of the predictor. Right, so if I were to set ahead equal to 1, and take expected value of Y, I would end up with, this would just be beta 0 plus beta 1. And if I were to set a head equal to 0, and take expected value of Y. That would just be beta 0, and if I take the difference, I just get beta 1. So beta 1 is this contrast between the respective value of Y given A-hat equal 1, and the respective value of Y given A-hat equal 0. But the question is, what is that actually mean? Okay, so, but remember what A-hat is from the first stage model, right? So, by the time we get to the second stage model, there's only two possible values of A-hat that you can have, right. So A-hat is a predicted value, and the case where Z is binary, it is only two possible values you can have, right. So, if somebody had Z=1, then their value of A-hat would be alpha hat 0. If somebody has equal to 1, their value of A-hat would be alpha hat 0 + alpha hat 1. So there's only two values of A-hat that you could observe. And in fact, they're equal to these coefficients from the first stage model. If you have some noncompliance, those are not going to be 0 and 1. But they're two values between 0 and 1. So in practice, what do we observe? Like, now, say we focus on the 2nd stage model, and we treat A-hat as just a variable. Well, what we observe, is we observe some people with A-hat 0 and an alpha hat 0, and some people with Alpha hat 0 + alpha hat 1. So we observe two unique values of A hat, so how much A-hat varies is just by this amount, right? And so, that corresponds to what happens when we go form Z equals 0 to Z equal 1. When we go from Z equals 0 to Z equal 1, we're going from A-hat equal to this to A-hat equal that. So, then if now if we think about a mean difference of this here. So if we thought about now, let's imagine a change from Z equal 1 to Z equal 0. That's corresponding to, that's the same thing as A-hat changing by alpha1-hat units. Right so, remember Z = 1 corresponds to Alpha hat 0 plus alpha hat 1. Z = 0 corresponds to alpha hat 0. If you take the difference, you get alpha hat 1. So if Z changes from 0 to 1, A-hat changes by alpha hat one units. Okay, so this is really critical. So this intention to treat effect, which is this, that's the intention to treat, right? The intention to treat effect is really all about a change in A-hat by alpha 1 hat units. But what beta 1 is, going back to the other slide, what is beta 1? Beta 1 is a change in A-hat of 1 full unit. So beta one is essentially sort of blowing up this intention to treat effect by this amount. Okay, so this is really critical, because the thing in the numerator here corresponds with a change in Z by alpha one hat units. Right. So to make that actually a change in one unit, which is what we want, we would have to divide by alpha hat 1. So what is that, then? Right, well we've seen before that the complier average causal effect is this ratio of the intention to treat effect, divided by the causal effect of treatment on compliance. Well, that turns out to be exactly [INAUDIBLE] the beta 1, because this denominator here, that is exactly alpha 1 head. This is what the slope of the stage 1 model is. Okay, so another was a lot of stuffs to follow. So might be worth, sort of going back to the video, and pausing your places to sort of walk through it. But I think if you sort of walk through those steps, it'll be clear. So beta 1, which is what you get out of a two-stage least squares estimator. In this case, it's equal to the complier average causal effect. So we just looked at the situation where we have binary treatment, we have binary instrument, and we didn't have any covariates. But two-stage least squares works more generally, which is why we're talking about it. So, you could have covariates that you want to control for. You might also have a continuous value for the instrument, for example. We can still apply, you can still use two stage least squares. So when stage 1 in this case, you would regress treatment on the instrument and covariates. Okay, so you would still do the same kind of thing, but you would also includes covariates now. And then, you would obtained the fitted value of A-hat. In Stage 2, you would regress Y on the value you got from Stage 1 and on X. So if you had covariates you wanted to control for, they would go on both of these models. And then the coefficient of A-hat, in this case, would still represent a causal effect. It would represent a local causal effect, again. So a causal effect among compliers. So the two stage least squares works more generally. Where you can use it if you had covariates that you want to control for, and you could use it also if you had a continuous instrument. So that's what two stage least squares is. And so, you could fit the two stage least scores model when you could get a estimate of causal effect, you could estimate a confidence interval from that. But of course, people are going to be very concerned about whether your assumptions are met, especially about the exclusion restriction. So is your exclusion restriction violated? And also another assumption that people worry about is monotonicity. And so you would want to, if you're going to do a careful analysis and publish a paper, or even for your own peace of mind, you probably want to do a sensitivity analysis. And so these kinds of methods have been developed around the instrumental variable assumptions. So in general, a sensitivity analysis, typically, has to deal with the kind of idea of what if my assumptions was violated? What if it was violated just barely, by a small amount? Would I change my conclusions? What if it was violated by a large amount? Would I change my conclusions? If we wanted to try and answer questions like that. So for the exclusion restriction, the idea would be, what if Z does directly affect Y? Maybe by some amount, [INAUDIBLE]. You know that could be a like a correlation kind of parameter or something. Would that change my conclusion? So you'd want to vary this kid of parameter. This row that quantifies how much Z directly affects Y. And at what point would my conclusion change? So there's methods that have been developed to do that. So beyond the scope of this video. The same kind of thing with monotonicity. Remember that monotonicity assumed that bigger doses of encouragement meant more likely to get treatment. But what if that's not true? What if there are some defiers? Maybe, the proportion of defiers in the population is pi. And maybe pi is greater than zero. So we had assumed it was zero, maybe it's a little greater than zero. So what if pi was 1%, 0.01? Would my conclusion change? What if it was 5%? Would my conclusions change? So that's the kind of thing you can do. And there's methods that have been developed to do that as well, with some standard software. But again beyond the scope of this presentation. But, I also do mentioned this tutorial in biostatistics, which describe some of these methods in detail.