Welcome.

In this video you will learn how to use XLMiner to perform logistical regression.

Throughout the video, I will use a medical appointment dataset.

I will first demonstrate how to build logistical regression model

with one predictor variable,

and then proceed to show you how to build a multiple logistic regression model

with multiple predictor variables.

At the end, I will show how to use XLMiner to partition the dataset

and perform cross-validation.

Here is the medical appointment data we discussed before.

To perform a logistic regression with one predictor variable,

we're going to use two columns: Status is used as the target variable

and the Lag is used as the predictor variable.

If you have XLMiner properly installed,

you should see the XLMiner ribbon when you bring up Excel.

To perform logistic regression, click on Classify and Logistic Regression.

Note that all variables are listed here.

Choose Lag and move it to Selected Variables and set Status as Output Variable.

At the bottom of the window, we need to specify Success class.

Since you are interested in predicting appointment cancellation,

you should check the box and select Cancelled in the drop-down menu.

Also note that the default value for cut-off probability is 0.5. Click Next.

You will see a number of options.

We skip them for now and click Finish.

This creates three new output sheets in Excel workbook.

Let's take a closer look at output sheet named LR Output.

At the top of the window is the Output Navigator,

which can lead you to different sections of the output.

Let's go down to the Regression Model section.

As we can see here, the coefficient estimates are -1.7431 and 0.01658.

This table also gives us the p-values which are close to zero for both coefficients,

indicating that the model is statistically significant.

There are some additional summary statistics in the table on the right.

In particular is the multiple R-squared is 0.03179.

Note that the multiple R-squared is a pseudo R-squared value

and does not share the same interpretation as R-squared from the linear regression model.

It also reports a residual deviance of 7,663.

Both the pseudo R-squared

and the residual deviance can be used to compare different models.

Larger values of pseudo R-squared and smaller values of residual deviance are preferred.

In the lower portion of the worksheet,

a summary report on the predictive performance is given.

Since we did not partition the data,

the result is based on applying the models of the whole data-set,

which is used as the training data.

Building the multiple logistic regression model follows almost the exact same steps.

Let's first return to worksheet with the data.

We would like to add the Gender variable to the model.

Before building the model, we first create a dummy variable for Gender.

Click Transform Categorical Data, Create Dummies.

In the pop-out window, move Gender to Variables to be Factored, and click OK.

This creates a new sheet called CreateDummies.

Note that in the Data section, two additional columns are added.

The second to last column is Gender_F,

where the value is one if the Gender is F and zero, otherwise.

The last column is called Gender_M,

where the value is one if Gender equal to M and zero, otherwise.

Click Classify and Logistic Regression.

Move Lag and Gender_M to Selected Variables and set Status as Output Variable.

Note that about choosing the variable Gender_M,

we choose to set the male to one and the female to zero.

Clicking Finish creates three additional sheets.

Go to the Regression Model section in the output.

We see that now we have three coefficient estimates.

The coefficient for Gender_M variable is -0.3572.

Since Gender_M equal one for male and zero for female, this suggests that the log-odds,

and consequently the probability of a cancellation, is smaller for male patients.

Also, the multiple R-squared value is 0.03586 and the residual deviance is 7,631.

Hence, by including the Gender variable, we increase the multiple R-squared value

and decrease the residual deviance.

Our last topic in this video is cross-validation.

The first step in cross-validation is to divide the data into training

and the validation sets.

Go back to the CreateDummies sheet,

which contains the updated data set with dummy variables.

Click Partition and then Standard Partition.

In the pop-out window, we can choose which variable to include.

We choose to include all variables by moving them to Selected Variables.

On the bottom, we can set the percentage for the training and validation set.

We accept the default 60-40 split.

Note that you can also choose to set your own percentage,

in which case you can divide the data into three sets instead of two sets,

where the last set is called the test set.

In order to replicate the results you see on the screen,

make sure that you check the Set seed checkbox and put 12345 in the input box.

Since the partition is randomly selected,

by setting the seed of the random number generator, you ensure

that the partition you have is exactly the same as mine.

Otherwise, you result will differ slightly from mine

because you are using a different data split for training and the validation.

Clicking OK bring us to a new worksheet with the partition data.

As you can see, there are 4,478 rows in the training data

and 2,985 rows in the validation set.

Now, we can bring out the logistic regression window, as before.

Observe that the worksheet is filled with Data_Partition,

which is a worksheet we just created.

Again, we're building logistic regression model with Status as Output Variable

and Lag and Gender_M as predictor variables.

Compared to the case without cross-validation,

now the output contains a summary report for Validation Data Scoring

which is different from the Training Data Scoring results.

The coefficient estimates are also a little different than before.

This is because when we perform cross-validation,

only the training data is used to build the model.

Note that in the confusion matrix, the class Cancelled is listed first.

This is slightly different from the ones used in other lectures.

The next worksheet was LR_ValidationLiftChart [inaudible] contains both the lift chart

and ROC curve.

As we discussed before, they can be used to compare different models.