[SOUND] In this example I'm going to steps we need to take in order to run regression using data analysis. This is an example from our PowerPoint where the production line is believed to be influenced by the downtime that the productivity is influenced by how much the line goes down and so the update for many days we are going to analyze it. The first thing we want to do is visually check whether or not this is going to be a linear regression. A linear model that will fit. I'm going to pick my variables, go to insert, scatter plot, and plot these out. So what I have done in the power point is that I formatted these so that it will be a little easier to read. So what you can see is that we don't have productive diesel down below. So one of the things I can do is I can just change the access, that it doesn't go from zero. So maybe it goes from 60 to something like 105. And it automatically looks a little bit more spread out you can see it a little bit better. So what you can see is that is indeed looks more or less like a scattered plot that is linear in its form. Now, by the way, you can go ahead and have it insert the equation but I really don't think that that's the best way for you to come up with the linear equation and then analyze it because it doesn't give you all the elements we want. So I really want to do this only for visual check. Now that I feel like I can apply linear regression, I'm going to go to data, data analysis, when the window comes up, go down, pick regression. Regression comes up. So the first thing it asks for is the input of Y range, this is what you're trying to predict so that's my productivity. I like to pick the labels also so I'm going to click here, control shift down, pick the entire data set then go to input of X range so that's the downtime, control shift down again. I have already told it that it has labels, and a new worksheet is what I wanted to have. So here's my output. And as I've mentioned on a monthly returns, three different tables. So I'm going to go over this with you. Now one of the things you can see is that the columns are not large enough to show all of this so one of the best things you can do is just click here in between the columns, and double-click and it will expand to what you need. That's the quickest way that you can do it. One other thing that you will notice in Excel is that it will give you the upper bound and the lower bound of the, confidence interval, and if you don't mention that you want 95%, the default is 95%, and it will give it to you twice. So, if you say 99%, it's going to show you lower 99%, upper 99%, but also it will show you lower and upper 95%, so I consider that to be a bug, so I usually just go ahead and delete these because it's redundant. So what you would see is that there is an intercept and there is a down time. So this is really my B 0 and my B 1. So based on this, my regression equation is 98.01- 1.648X, X being the downtown. How many hours the line has been down. So, one of the things that you want to look at is P value for downtown is extremely small. One of the things that in regression does is that the null hypothesis is that X and Y has no relationship and if you reject that that means that you have found a variable that has a relationship with other variable and the p is so small will reject a null hypothesis the null hypothesis being they are not related at all. Then one of the things that we want to look at is R square. And R Square is the percent of the variation that you can explain in the productivity because of down time so 88.4% of productivity differences that we have collected in our data seems to be because of downtime. So let me explain R Square a little bit in more detail. For that I'm going to go to this table ANOVA table. What we see here is this is the sum of squares and this is the sum of squares due to regression variable and this is the on square due to total. The solo square of the total shows the total variations that we see in our data and some of this is coming from the regression variable and some of this is coming from the errors, things that we have not counted for, we don't know what they are at this point. R Square R squared is the ratio of SSR to SST, and the higher this value is, the more explanation we have found for the variations that we see, which means that we have a good predictive model. So if I take that ratio right here, this is SSR divided by SST, and what you see here is exactly what you see here. That's why r squared represents the percentage of explained variations due to the regression variable. Now multiple r is your correlation. That is your correlation, and correlation is just simply square root of R Square, if I take the square root of this value, you would ger R. As you can see these values are exactly the same. So R is the square root of R Square. Correlation is a value that goes from zero to one. Or zero to negative one. The closer you are to one, the higher is the degree of correlation between the two variables. And the closer you are to zero, there is less correlation. The sign is simply telling you, if they are directly related, directly related means that if one goes up the other one also goes up, or negatively correlated though, inversely correlated. Which is if one goes up, the other one goes down. So look at our downtime. As the number of hours of downtime goes up, what is happening to our productivity? It goes down. So this is a negative correlation. A positive correlation would have had your points going this way. So this is a negative correlation. How would you have known it by looking at this output? You would know this by looking at the sign. So the sign of the coefficient. Which you also see here is telling you if this a positive correlation or a negative correlations. So, what does this mean now? That means, that for every hour that you're down, our productivity goes down by another 1.6%. That's basically what it means. So, if I told you that X=8 and I ask you to predict what would be the productivity. You simply will come and use this equation to answer that. And you can answer that, and I like just taking this values right here and copying it elsewhere. And now I'm going to say for X of eight hours of down time, what is my prediction? It's simply this value plus this value multiplied by eight. And that will give you 84.8. So if you're down in a given month for eight hours total you expect your average productivity for that month to be 84.82 now let me make a point here. The coefficients that you see here are the middle of a confidence interval. So, what you can see here is that you are also doing your confidence interval here. So regression is bringing everything we have learned so far together. They bring hypothesis testing when we do key value to check whether or not there is. Significant relationship between variable that you identify, is going to give you a confidence interval for every variable, so then I come up with this value. I have used the middle values, but this is just one estimation. We actually have to come up with a confidence interval for our prediction as well. The reason I am not showing that to you is because Excel really does not do a very good job of coming up with a predictive variable. We need to have another set of equations. So I'm ignoring it, so what I'm showing you is a point estimation. This is a point estimation of if you had a month for eight hours. Of down time. What would be the point estimates of its predicted productivity?