0:00

Welcome back.

Let's now turn to the second example explored in class this week,

namely the HR example.

As a quick reminder, our goal is to understand what distinguishes

employees who stay in the company, and those who leave the company.

To do so we're going to explore a data set very similar to the one we

had in module two.

However, this time, we've added more employees, and more importantly, we've

added an outcome variable which tells us if the employee indeed left or not.

All right, let's turn to and analyze our data set.

Once you've set your directory to the folder where you have downloaded the HR

data set, you run this line to clean up the current memory of your R station.

We then load our data set with the read.table function and call it data.

Now let's have a look at our variables with the str function and

see some similar statistics with the summary function.

The str function shows us that we have 12,000 observations in this data set,

and seven variables.

First, we have the satisfaction, then LPE,

which stands for last project evaluation, then NP,

which is the number of projects worked on by the employee in the last 12 months.

Then the ANH, which is the average numbers of hours worked

weekly by the employee over the last 12 months.

Then the TIC variable which is the time spent in the company.

And then the newborn variable which tells us whether or

not the employee had a newborn within the last 12 month, and

finally, our outcome variable which is called left.

Which will take the value one if the employee indeed left and zero otherwise.

To have a better understanding of our data sets, we run the summary function.

1:45

We see that nearly 17% of our employees have left the company.

We can also see that 15.4% of the employees

had a newborn within the last 12 month and also that, on average,

employees have been in the company for 3.2 years.

Remember that using the summary function and

looking at its output allows you to get more familiar with you data set.

You can often find out a lot of information with the summary function,

which will save you a lot of time and drive your research.

Another thing we could do in order to get familiar with our data sets is to use

the table function in order to obtain the frequencies of the left variable.

Here we see that 10,000 observations are employees who stayed in the company.

While 2000 of them are employees who left.

If we wanted to have the same information but in percentages, we would need to

divided by the end row function on our data set, which counts the number of rows.

Let's run the line, and now we can see that we obtained the same

2:52

point 16.666% that we obtained with the summary function.

Alternatively, we could also plot a histogram with the hist function

on the left variable and we would see the same information, but visually.

Now that we're more comfortable with our data set, let's turn back to our focus for

this tutorial, which is to try to understand

what is different between employees who leave the company and those who stay.

We can check out the correlation between our variables.

As we said in the previous video,

we call the core function on our data set and run it.

As a quick reminder, correlation only gives us information about the strength

and the direction of the linear relationship between two variables.

3:33

If we look at the correlation of all our variables with their left variable.

We see that satisfaction is negatively correlated with it.

TIC is very moderately positively correlated with it.

While a newborn is fairly weakly negatively correlated with it.

Now, this does not tell us much, because we're looking at the relationship

between the left variable and all the others, but separately.

What we want is to understand how they interact with each other.

For instance, here, the correlation between the employee leaving the firm and

the numbers of projects worked on is very,

very weakly positive looking at correlation only.

However you saw in class that all else equal the number of projects

worked on actually has a significant negative relation to attrition.

Looking only at correlation, this is something we would have missed.

Which is why we need to add another tool to our current toolkit and

learn how to build a logistic regression model in R.

To build a logistic regression model in R, we use the glm function.

Our first argument is the outcome variable, for us it's left.

You then insert a tilde followed by the variables you want

to use to build your model.

In our case, we'll use all the other variables, so we type a dot.

We then set the family argument equal to binomial (logit) and

then the data argument equal to our dataset, which is data.

Let's run the line.

First, let's note that our fitted values are the output of the model

on the in sample data.

In the case of a logistic regression model, the output is a probability.

Let's see the proportion of employees' attrition according to the model

by building a histogram of the fitted values.

To do so we use the hist function that we already know and the fitted values.

The frequency is on the vertical axis while the fitted values,

which are probabilities, are on the horizontal axis.

Now let's assess how the model is performing.

We can start by assessing the correlation between the estimated attrition by

the model and the actual one by 0.41,

that's a linear relationship between the two that is positively shown.

That's a pretty good sign.

Now, since the output of a logistic regression is a probability,

that is a continuous variable that can take any value between zero and one.

We need to set a value that tells us what we consider to be a leaver and

what we consider a person who stays.

This is called the cut off value.

Let's set our cut off to 0.5.

This means that any prediction that is above 0.5 will be a leaver,

and any prediction below will be a stay.

Let's evaluate the predictions based on our cutoff value.

To obtain predictions we could run these three lines and we will in a minute.

But for now, let's try to build the confusion matrix in the command line.

What we want is to know when the fitted value is above 0.5.

This will return true.

And if it's below, it will return false.

And then we wan to compare to the actual values.

So we type table logreg$fittedvalues superior or

equal to cut off, and

then we compare with the actual values.

Data.$left.

Let's run it. Now what we see here is the fitted values

and here the actual values.

So here, this means that these people were predicted to stay and, indeed, stayed.

While 536 people were predicted by the model to leave and, in fact, they stayed.

Now based on our matrix,

let's compute the percentage of correctly classified employees who stayed.

Stayed means that the left variable is zero and that the model predicted a false.

8:34

And we get 82% accuracy.

Again if we run this line, we get the same thing.

One question you may have is how do we chose a cutoff value?

Well, the answer is not easy.

It depends on the circumstances and the objective of your analysis.

Let me invite you to play with the cutoff value,

in order to understand how it works.

For now, let's set it at .7.

Let's compute the percentage of currently classified employees who stayed.

And here you can see that we get a number that is higher, 0.9805.

Well, it was 0.9464 with a lower cutoff value.

Let's compute the percentage of correctly classified employees who left.

9:17

Now we get a much lower value.

We get 0.012.

Well, we had 0.1905 before.

Now let's compute the overall percentage of correctly classified employees,

and we get a lower value at 0.8190 etc., while we had 0.8204 before.

What we find out here is that when the cutoff value is large,

we rarely predict the outcome.

In our case, we rarely predict that the employee is leaving.

But this allows us to identify the employees most likely to leave.

And we can take actions designed specifically to these people.

Now, let's set our cutoff value to 0.3.

First, let's compute the number of correctly classified employees who've

stayed.

Now we get 0.8888, which is slightly lower than

the 0.9464 that we got with the 0.5 as a cutoff value.

We then compute the percentage of correctly classified employees who left,

and we get .473, which is higher than the 0.8,

which is higher than the .1905 that we got before.

What we understand here is that,

if the cutoff value is low, we rarely predict that the employee is staying.

And we make more errors where we predict that the employee will leave.

When in fact, he or she does not leave.

This allows us to identify the employees who might leave.

Here it's really good because we can take preventative actions to try to

retain those people.

Now, that we know our model is performing, let's use the output of

the summary function to understand what distinguishes our two categories.

To do so, we call the summary function on our model.

If we look at the p value, all our predictors are statistically significant.

To look at how important they are, we can look at the absolute value of the z value.

It tells us that the satisfaction level is the most important,

followed by TIC and the numbers of projects worked on.

We then look at the coefficient to see the direction of the effect.

The effect of satisfaction is negative on attrition, so

it is positive for the business.

Same for the number of projects worked on.

This means that the more projects worked on, the smallest the attrition.

Inversely, the effect of the time spent in the company is positive in attrition,

so it is negative for the business.

Now that we know that the time spent in the company, or TIC,

is significant, we want to further explore it.

Maybe we can try to plot attrition as a function of time spent in the company.

To do so, I use the plot function and my first argument is the TIC variable,

which I want to have on the horizontal axis.

While my second argument is our outcome variable, which is our left variable, and

which I want on the vertical axis.

I then added a title with the main argument and

axis labels with the Y lab and the X lab arguments.

Let's now run this line.

12:20

Huh.

The output is not really what we intended to have, right?

Well, the reason is that our left variable is binary,

meaning that it only takes two values, namely 0 or 1.

Can you think of a way to circumvent that issue?

Well, what I thought about is to compute the proportion of leavers

by years spent in the company.

That means that that we can calculate the proportion of attrition among all

those who have spent one year in the company, then repeat for

those who have spent two years in the company, etc., etc., up to six years.

How do we do that in R?

First, let's make a copy of our data set and store it in ten data.

13:32

The output is composed of two column.

First the TIC variable and second the mean of the left variable.

The TIC variable takes a value 2 and for the people who have spent

two years in the company, the mean attrition level is 0.01.

Then 3 for the people who have spent three years in the company.

The mean attrition is 0.16, etc., etc.

Up to people who have spent six years in the company,

which have a mean attrition of 0.21.

Let's plot the left column of ag b time rank as a function of

TIC of ag b time rank.

I've added title in labs that you can explore on your own.

So here we find out that attrition increases with the years spent

in the company up to year five, then, it decreases sharply.

This is an interesting fact, so let's investigate further.

We computed mean values, but

we do not know how many employees each dot represents.

To compute the number of leavers in each group, we rerun the aggregate function.

But we change the FUN argument from mean to length,

since we want to do some counting.

And we store the result in CNTB time rank.

Let's check out the output of CNT time rank.

There are 3,021 employees who've spent two years at the company.

5,322 who spent three years at the company, etc., etc., up to our point

of interest, which is the employees who spent six years in the company.,

and there are 512 of them, so there's still a lot of people.

And it's worth reporting to decision makers.

If we want to build a report on it,

we can build a nice visualization by running the following line.

15:22

What we're doing here is that we build the same plot as we did before.

What we're doing here is that we build the same plot as we did before and

then we ask r to represent the dots by circles of sizes

proportional to the number of employees in each group.

I won't go into more detail, but

there are a few arguments in this line that you should explore and play with.

Let's now shift our focus to our most important driver of attrition,

namely the satisfaction level and

see if we can get some insights that will allow us to retain more employees.

First, as we did for the TIC variable, we need to prepare our data and

find a way to have a limited number of values that can be taken by satisfaction.

What we can do is to rank the satisfaction variable from the largest to the smallest.

Meaning from the most satisfied employee to the least satisfied one.

As always, first let's make a copy of our data set.

We then add a variable to our team data set called rank status.

We use the rank function because we are ranking from the most satisfied to

the least satisfied, we add a minus sign before the variable we want to go rank on.

Maybe here 10 data, they'll assign S.

We then divide the output by 600 and round the results.

They have 21 groups of similar satisfaction.

Let's run this line.

Let's check out the output by typing teamdata$rankstatus.

What we see is that we obtain, instead of the satisfaction level that we had before,

numbers between 0 and 20, which allows us to have 21 groups of similar satisfaction.

Like we did before, we compute the average attrition rate for each group and

we count the number of employees in each group.

Then like we did before, we plot the attrition level.

Let's see what we got.

17:16

Here we have the very happy people that want to stay.

Here we have the people that Professor Grady called the it's okay people.

They're not really happy but they don't really want to leave.

What's way more interesting are those in between.

They're happy but they still want to leave.

This is definitely something you should investigate further if you

were on an HR analytics team.

Maybe they were just burned out.

Maybe they were hired by clients.

In any case, these are likely to be the people you want to target

when undertaking retaining actions.

And then to the right you've got the unhappy people that indeed want to leave,

which is much less surprising.

So that's it for this recital.

I hope you learned a lot and had some fun playing with the data.

I will see you in the next module.