So let me take you directly to Rattle this time, and walk you through this model. So I'm going to start Rattle. So I'll say library, Rattle, and I'm going to type rattle. Now remember, the data set is in my working directory. So I have to open the data and the data have stored and where you store you'll have to find out where you have stored. So I have stored it. It takes me a long way to get there, I'm sorry. So it's deep inside various directories. There you go. There you go. It's barely there and there it is. I promise you we'd get there, it's called auction.csv. So it depends on where you have stored that data file, just say Open. So it says auction.csv. As usual, you execute. But one thing I'm going to do, I'm going to partition this data 80-20. So because the data set is small, I'm going to just partition at 80/20/0, and you already know what that means. The other thing is the bid is identified a variable, because it's a bid number 1, 2, 3. I'm not going to use the baby. So either model phi 28 is one or model phi 26 is one or the baby is one. Now, the sum of the three is equal to one, so this becomes completely correlated. So I can safely ignore the baby to avoid collinearity. Okay. The target variable is whether you won the auction or not, which is here. I'm going to also remove MSRP, I'm going to ignore it, and I'm going to ignore the price. The reason is this; the third variable is the difference between MSRP and price. So that's the explanatory variable I want. I don't want to include all three because first of all you learn multicollinearity. Secondly, I believe the thing that decides whether I've win the auction or not is how far I am from the MSRP. The farther away I am, the less likely I'm going to win the auction. The closer I am, the more likely I'm going to win the auction. Finally, one other point I want you to notice, it says the one auction, it doesn't know actually. It says auto, the target data type, I'm going to make it categoric, just to be safe, otherwise, it may make it a numerical variable. So these are the changes I have made. I made the bid identifier available. I have ignored MSRP and price. I have retained the difference between the two. I have ignored the baby model because if it is not the other two, it is baby model. Finally, I've changed the able type of one auction to categoric, and I hit execute. Now hit execute only after all of this, that's it. Just select it today. So let's run the model. To run the model you just go Model, go Linear. Immediately it will give you only two options; logic or probit. If you read carefully it will say that this falls into a category of what we call a generalized linear models and that's a whole subject and a whole books have been written about it. But there are two ways of estimating the linear model, the logic model or the probit model, and it tells you a probit regression gives similar results to the logistic regression, but often with smaller coefficients, and you hit execute. Okay. It gives an answer. Now, let's go and see what these answers mean. So where we are? Right here is the answer it gives. If you look at these answers it will tell you each of these coefficients, right? It will tell you whether this coefficient is significant or not. If you really look at it, there are only two coefficients that are significant note. One is the difference as we suspected between MSRP and price. The other is whether the clock has been serviced or not, right/ All other variables seem to be just hanging around. Maybe they're helping, maybe they're not helping. The second thing it tells you is what I look for, it gives you a pseudo R-squared. So if you like that it says 49 percent. Okay. It also, sort of there is one other point I want to make. You see the null deviance, that is a model which has no explanatory variables, and the residual deviance is a model red. I'm using these variables and it says whether this is significant or not. So if you look at it, the null residual deviance difference is 59.566 with six degrees of freedom, and it says the P-value 0000. That means its highly significant. So far so good. You're not interested in me, you're still interested in how are we doing in this auction, right? So maybe go back to Rattle. We want to evaluate this model. So basically, all that is fine, all that is statistics, but I really want to evaluate it. Now, the moment you hit evaluate, it gives a new button called error matrix. In the last session of this course, last module of this course, I'm going to talk a lot more about model selection. But at this moment, we just want to know how well we are doing in this auction. So just look at the error matrix. Okay. What data matrix gives you is how well would you have done if you use this model for bidding. So the error matrix can be produced either as usual on the training or the validation data. As you notice, the validation data has been checked. So what does it do? You will see that it has two sets of numbers. The one on the top gives you the number of items with that prediction versus actual, and the bottom matrix gives you the same information in terms of percentages. Let's ignore the bottom one column for now. Okay. So on the top, if you look at the first row there were actually 36 auctions, which you would have won with the price put in. I would have won the price I put in 12 hours before. All of these 36. Okay. The error matrix is saying I would have got 32 right and four wrong. Similarly, I would have won 19 auctions with a price I put in, but based on the logistics regression prediction, if I believe that model's prediction I would have won. I would have predicted only 10 of them right. If you look at one and one, you will see the number 10 there, but I got nine drawn. So question really is, how does R or Rattle compute these numbers? We'll see that we have some control on them, but by default it does is it runs the logistics model with all the data you have on the clock and MSRP, and everything else you want to put in. It comes up with a probability whether you will win the auction or not. The way it scores at if this probability is more than half, it predicts that you will win the auction, and if the probability is less than half, it predicts that you will lose the auction. Why half? You will learn later that half in some sense minimizes the overall error rate, but you can choose a different value. Okay. So let's lower this value half. Just remember what the logistics regression in the analysis is doing based on the regression output. If the predicted probability of winning is half or more, it says you will win the auction. If it is less, it says you will lose the auction. Based on that, it says that you are making 11 percent error, 11.1 percent the model, error will be 11.1 percent of the auctions you did not win, but there is a 47.4 percent error on the auctions you won. The overall error rate is in some sense average of it. The weighted average value is 23.6 percent. Whereas, just a simple average between 11 and 47 is 29.25 percent.