Welcome, in this video, you will learn how to use XLMiner to perform logistic regression. Throughout the video, I will use medical appointment data set. I will first demonstrate how to build a logistic regression model with one particular variable. And then proceed to show you how to build a multiple logistic regression model, with multiple predicted variables. And then, I will show how to use XLMiner to partition the data set and perform cross validation. Here is the medical appointment data we discussed before. To perform a logistical regression with one particular variable, we are going to use two columns. Status is used as the targeted variable and the lab is used as the predictor variable. If you have XLMiner properly installed, you should see the XLMiner ribbon when you bring up Excel. To perform logistic regression, click on Classify and Logistic Regression. Note that all variables are listed here. Choose Lag and move it to Select New Variables and set Status as Output Variable. At the bottom of the window we need to specify success class. Since you are interested in predicting appointment cancellation, you should check the box and select Canceled in the drop-down menu. Also know that the default value for cut off permitted is 0.5. Click Labs, you will see a number of options. We will skip them for now and click Finish. This creates three new output sheets in the Excel workbook. Let's take a closure look at output sheet name LR Output. At the top of the window is Output Navigator, which can lead you to different sections of the output. Let's scroll down to the Regression Model section. As we can see here, the coefficient estimates are -1.7431 and 0.01658. This table also gives us the p values, which are close to zero for both coefficients. Indicating that the model is statistically significant. There are some additional summary statistics in the table on the right. In particular, the Multiple R squared is 0.03179. Note that the Multiple R squared is a shorter r squared value and does not share the same interpretation as r squared from the linear regression model. It also reports a residual deviance of 7,663. Both the shorter r squared and the residual deviance can be used to compare different models. Larger values of shorter r squared and smaller values of residual deviance are preferred. In the lower portion of the worksheet, a summary report on the predictive performance is given. Since we did not partition the data, the result is based on applying the models of the whole data set, which is used as a training data. Building the multiple logistic regression model follows almost the exact same steps that first return to the worksheet with the data. We would like to add the gender variable to the model. Before building the model, we first create a dummy variable for gender. Click Transform > Transform Categorical Data > Create Dummies. In a pop-out window, move Gender to Variables to be Factored, and click OK. This creates a new sheet called Create Dummies. Know that in the Data section, two additional columns are added. The second to last column is Gender_F, where the value is 1 if gender is F and 0 otherwise. The last column is called Gender_M, where the value is 1 if gender is equal to M and 0 otherwise. Click Classified Logistic Regression. Move Lag and Gender_M to select variables and assess status as output variable. Note that by choosing the variable Gender_F, we choose to set male to 1 and female to 0. Clicking Finish creates three additional sheets. Go to the Regression Model section output, we see that now we have three coefficient estimates. The coefficient for Gender_M variable is -0.3572. Since Gender_M equals 1 for male and 0 for female, this suggests that the log r's and the consequences of the probability of a cancellation is smaller for male patients. Also, the Multiple R squared value is 0.03586 and residual deviance is 7631. Hence by including the gender variable, we increase the Multiple R squared value and decrease the residual deviance. Our last topic in this video is cross validation. The first step in cross validation is to divide the data into training and validation sets. Go back to the Create Dummies sheet, which contains updated data set with dummy variables. Click Partition and then Standard Partition. In the pop-up window, we can choose which variable to include. We choose to include all variables by moving them to select the variables. On the bottom we can set the percentage for the training and validation set. We accept the default 60-40 split. Know that you can also choose to set your own percentage, in which case you can divide the data into three sets instead of two sets, where the last set is called the test set. In order to replicate the results you see on the screen, make sure that you check the Set Seed checkbox and put 12345 in the input box. Since the partition's randomly selected, by setting the seed of the random number generator, you ensure that the partition you have exactly the same as mine. Otherwise, your result will differ slightly from mine because you are using a different data split for training and validation. Clicking OK brings us to a new worksheet with partition data. As you can see, there are 4,478 rows in the training data and 2,985 rows in the validation set. Now, we can bring up the logistic regression window as before. Observe, that the worksheet is filled with data partition, which is the worksheet we've just created. Again, we'll build logistic regression model with Status as output variable and Lag and Gender_M as predicter variables. Compared to the case without cross validation, now the output contains a summary report for validation data scoring, which is different from the training data scoring results. The coefficient estimates are also a little different from than before. This is because when we perform cross validation, only the training data is used to build the model. Note that, in the confusion matrix, the class canceled is listed first. This is slightly different from the ones used in other lectures. The last worksheet with error validation lift chart in the name contains both the lift chart and the ROC curve. As we discussed before, they can be used to compare different models.