In this video, we use an example to illustrate how to pair a logistic regression model and interpret the result. Missed appointments cost the US healthcare system over $150 billion a year. Missed appointments directly cause loss of revenue and under-utilization of precious medical resources. It also leads to long patient waiting times and in the long run, leads to higher medical costs. We obtained a data set with 7,463 medical appointments over a three year period at a specialized care clinic. In this data set, each row corresponds to an appointment and indicates whether it was cancelled or not. In total, 1662 out of 7463 appointments were cancelled. Since we are interested in appointment cancellations, the target variable is whether an appointment is cancelled or not and success in this particular context means that an appointment is cancelled. The data set has many columns. The columns can be roughly divided into two groups. One group captures appointment information. One particularly important variable here is lag, which is the time between appointment date and the time an appointment was made. For example, if an appointment is made one week in advance, then the lag is seven, representing a time difference of seven days between appointment date and the time appointment was made. The other group is patient information, including personal information such as gender, age, martial status, employment status, and medical related information, such as time since the patient registered with the clinic and the insurance policies the patient carries. We will only use a few columns of the data set. We would like to explore the relationship between appointment lag and appointment status. We'll first explore the data using different graphs. Note that here the appointment lag is a numerical variable, but the appointment status is a categorical variable. Therefore, a scatter plot cannot be used. Here is a histogram of the appointment lag. Observe that the lag ranges between 0 and 130 and its distribution is right skewed. The majority of appointments are made within 60 days. The distribution is bi-modal with one mode between 10 and 20, and another mode between 90 and 100. Another way to explore the relationship is to use a bar graph. This bar graph shows the number of cancelled appointments against appointment lag. Not surprisingly, most cancelled appointments have lag less than 60. This is simply because the majority of appointments have lag less than 60. In fact, the pattern of the bars are similar to that in a histogram of appointment lags. Perhaps a more relevant statistic here is the cancellation rate, which is the proportion of cancelled appointments. This rate can be used as an estimate of cancellation probability. This line graph shows the cancellation rate against lag. The cancellation rate fluctuates widely as a function of lag. However, line graph seems to have some positive trend, where cancellation rate appears to be higher for larger lag values. Our logistical regression model will try to capture this trend. Here is our first logistic regression model, where we model the log odds of appointment cancellation as a linear function of the lag variable. Beta 0 and beta 1 here are the intercept and slope respectively. After the model, we obtain beta 0 equal to minus 1.7431 and beta y equals to 0.01658. Let me emphasize here that the coefficient estimates, beta 0 and beta 1 are related to the cancellation probabilities in a non-linear manner. These coefficients are relative log odds in a linear manner. For example, for one day increase in lag, the log odds increases by beta 1. By exponentiating on both sides, we obtain the odds. After some algebra, we obtain the cancellation probability as a logistic function. It can be verified that since beta 1 is a positive number, the predicted probability is increasing in lag. That is, appointments with longer lags are more likely to be cancelled. The equation can be used to make predictions on cancellation probabilities. What is the predicted cancellation probability for an appointment with a 10 day lag? Taking lag equal to 10 and plugging it into the equation, we obtain 0.1712. Therefore, the predicted cancellation probability is 17.12%. How do we assess model feed for logistic regression? For linear regression, we discussed before, we can use r squared. However, r squared is not applicable to logistic regression. There are several commonly used measures including pseudo r squared, deviance, and AIC. These measures are used for comparing models. In general, larger pseudo r squared, smaller deviance and a smaller AIC are preferred. Note that there are many different variances of pseudo r squared. All of them try to mimic r squared for linear regression and have similar properties as r squared. Several tools typically report a subset of these measures. For example, rx minor reports both pseudo r squared and deviance. Another question is how reliable the coefficient estimates are? Like linear regression, this question can be assessed using statistical significance, indicated by p-values. For our models, p-values for both coefficients are close to zero, indicating that both coefficients are statistically significant. P value can also be used as an imperfect way to determine whether to include a variable. Coefficients in logistic regression model represent the change in the log odds of the outcome for one unit increase in the predictor variable. In our model, the coefficient estimate beta 1 gives the increase in log odds for one day increase in the value of the lag variable. It is common to exponentiate the coefficients and interpret them as odds, which might be more intuitive to understand.