[SOUND] A national newspaper publisher wants to model the relationship between the salaries of its field reporters with their years of experience. They have collected for 34 randomly selected reporters. Are they correct in thinking that a relationship does exist between one's salary and years of experience? So here the independent variable is the years of experience, that would be x and the response variable is the salary and this would be y. We will first begin by doing a scatter plot to see if we can detect a linear relationship between the two variables. From the scatter plot, we get a sense for a strong positive correlation between years of experience and one's salary. So we can go ahead and run the data through the regression analysis, but before doing that, let me alert you to a common mistake mixing up the two variables. This scatter plot shows a positive correlation too, but there's something major wrong with it. Can you detect it? If not, pay close attention to what variable is on the x-axis and what is on the y-axis. Salary. The dependent variable is being displayed on the x-axis. That's a no, no. Just like I have said before, Excel is automating tasks for you. It will not be checking your model. No software will. You will have to be sure that you model the problem correctly. Otherwise, you will be looking at an invalid analysis. Now, we can go ahead and do it correctly and run the analysis. So, here is he output. First, let's see what percent variations we see in someone's salary can be explained by their number of years of experience. So, here is he output. First, let's see what percent variations we see in someone's salary can be explained by their number of years of experience. The answer to this is in R Square. We can now see what the regression equation be and that is found by looking here. So then, the equation is y=60.699+2.169x. Now, lets understand this better. The intercept of this equation is 60.699. This means that if you put x equal to 0, this is the value of y. Then based on our data, this should predict the predicted starting salary for a reporter in this organization. Then we have the coefficient of independent variable, number of years of experience. This means for every year, the salary has gone up by 2.169, which is really $2,169. Data was scaled by thousands. This 2.169 represents the average raise that reporters get per year. Now if I asked you what would you expect the average salary be if I have a sample of reporters with an average of 15 years experience, the answer will be 60.699+2.169 times 15, which is 93.24. I will multiply this by 1,000. Again, because the data was entered in thousands of dollars and that would be $93,240. Since we such a high value of r square, it is not surprising that we get a very small p-value for years highlighted in yellow. We have rejected the null hypothesis that one's years of experience has nothing to do with the person's salary. We also see in red font that salary and years of experience are highly correlated at .91. This is a positive correlation, as we see the sign of the coefficient for years is positive. This means that as the reporters length of stay has increased, so has their salary. Again, on average about $2,169 per year. Please be mindful that the coefficient we have used is a point estimate. This is the midpoint between the confidence interval for the coefficient, which means the true average value of salary increase can be somewhere between 1,825 to $2,514. To do better, we should calculate the confidence interval for the prediction. This is not done every easily in Excel though. To reduce production interruption for the robotic painting line due to break downs, the plant manager would like to institute a preventative maintenance program for the robots. This initiative will require support from the upper management. So, he has collected data on hours of total down time per month and the line's productivity. He believes he can use regression for this analysis. In this study, the average number of the hours the line is down is the independent variable x and the response variable is the productivity measure of the line. Scatter plot for the data is shown here. A strong negative correlation between the two variables as the number of hours the line is down increases, the productivity measure drops in a linear form. So, we can apply simple linear regression to explore this relationship further. We have used Excel to analyze the data. The degree of correlation between average hours of downtime in a month and productivity are negatively correlated. And r, the correlation is -0.94. Again, the sign comes from the sign for the downtime coefficient. R square shows that 88.4% of variations in productivity we seen are data can be explained by the regression variable down time. The remaining 11.6% are due to other sources we call error. So given that we have a strong relationship, we can go ahead and develop the regression equation. To develop the regression equation, we find the coefficients in the third part of the output. So, the equation is 98.0143-1.648x. Looking at this equation, which is based on our data, the baseline for productivity is about 98%. And for every hour, the line is down, the productivity goes down by an additional 1.648%. The manager believes that on average, the line is down for about eight hours per month. If this is indeed the case, then what is the average productivity expected from the production line for the months with this much down time? To find the predicted productivity, we put value of 8 in variable x and that will give us the average productivity to be 84.82%. So, making the line more reliable by instituting a preventive maintenance program will reduce breakdowns and that will improve productivity. Now, the plant manager can use this insight and do some economic analysis comparing the cost of maintenance program versus the cost associated with lost productivity. As you can see with the regression, we not only have the ability to infer about the population, but now we can predict the value of the response variable for a given independent variable. Let's practice. An advisor in the university would like to motivate her students to study hard and graduate with high GPA. To stress the importance of GPA, she's wondering if she can show data on the relationship of a graduate's starting salary and the student's GPA. She has collected data on 100 recent graduates. What are the dependent and independent variables in this study? The independent variable is the student's GPA, which the advisor believes influences the starting salary, so that would be the y variable, which is the response variable. Now based on this scatter plot, how would you characterize the relationship between GPA and starting salary? There appears to be a strong and positive relationship between GPA and starting salary. What percent of variability seen in a student's starting salary can be explained by the student's GPA? Based on our analysis, r square is about 0.3365. So, GPA explains 33.65% of the variations we see in the starting salaries in our data, which means that remaining variation you see is due to other sources, unexplained by our model. Is this advisor wrong? Does GPA appear to be not that important? Actually, the answer to that is here in the p-value and this value here is practically zero. Recall that regression is running a hypothesis test on the independent variable. Null hypothesis is that no relationship exists. You can think of regression, as a cynic. You have to have data to change its mind. In this case, GPA has such a small p-value that you would reject the null hypothesis at any level of significance. So then, why is r squared not higher? What r square is saying is that while about a third of the differences observed in a salary was explained by GPA, two-thirds of the differences are due to other sources. Based on the key value, you have identified one very significant source and that's good. But your regression equation is not very good, because most of the variability is due to other factors. For example, could one of those factors be students' major? You would expect that to be a factor, wouldn't you? So, the advisor needs to expand her data collection and include more independent variables. But then you will no longer be doing a simple linear regression, but a multiple regression. This is a topic of the last module in this class and I will come back to this later on.