0:00

One of the most important components of

Â building a machine learning algorithm or prediction model

Â is understanding how the data actually look

Â and how the data interact with each other.

Â So, the best way to do that is actually

Â by plotting the data, in particular plotting the predictors.

Â So for this example we're going to be using this wages data.

Â So the wages data is actually data from the ISLR package.

Â Which you can find at this link.

Â And it's from the book Introduction to Statistical Learning.

Â So we're going to be looking at this data

Â and seeing how we can use it for prediction.

Â So if I load the ISLR package.

Â Again I'm going to have to install it first and then I can load it.

Â And I also load the ggplot2 package because

Â we're going to be using that for some plotting.

Â And the caret package, because we're going to be using that for model building.

Â The wage data is actually in the ISLR package, and

Â I can load it with data and a capital w Wage.

Â And then I can look at a summary of that wage

Â data to look at all the different variables that are in there.

Â So we have the year of the data that's collected.

Â The age of the person who is the data is collected on.

Â In this case, it's only male people in this data set,

Â the marital status of those people, their race, education, the region where

Â they were, the data was collected is just the Mid Atlantic

Â region, and then the different kinds of job classes and their health.

Â So this gives you a little bit of information about what

Â the type of data that we're going to be looking at.

Â And you can already see that we've detected a few interesting

Â characteristics of this data just by looking at a summary here.

Â We know that they're all men.

Â We know that they're all in the Mid Atlantic region, for example.

Â 1:29

So then what we do is again, just like always,

Â we're going to build a training set and a test set.

Â Even before we do exploration, we're going to set aside the

Â testing set and we're not going to use it for anything

Â until we actually look at the data at the end

Â of the, model building experience, and apply it just one time.

Â So we're going to do all our plotting in the training set.

Â So one example is to use this feature

Â plot, plot that comes from the caret package.

Â So this plot will plot basically all of

Â the features against each other so this plot

Â looks a little bit confusing for this data so I thought I'd just go through it.

Â So here I'm, using as the outcome I'm saying the outcome is the wage.

Â And then I'm going to look at all

Â these different variables, age, education, and job class.

Â So, first of all I have this y variable so that's

Â the outcome that we care about, and here the different variables.

Â I've got age, education, you can't read it very clearly in this

Â plot, but if you make it yourself you can, and job class here.

Â And so, what this plot is, this is the plot

Â here of the outcome y versus the job class here.

Â So you can see, you can do that for every box, so if you want to see the

Â plot of job class versus education, that's this plot here.

Â And it's the same plot here, only with the axis reversed and so forth.

Â So it shows you all of the variables that you have here, plotted against each other.

Â And in particular, what you are looking for, here are all the

Â plots corresponding to each of the variables plotted versus the y variable.

Â And you're looking for any variable that seems

Â to show a relationship with the y variable.

Â So, for example, you can see that there seems to

Â be a trend here between education and salary, for example.

Â So this is one way that you can look at all the data.

Â Another thing that you're going to use is either Qplot, function

Â in the ggplot2 package or just the plot function base R.

Â 3:19

So here I'm plotting age versus wage, and so you can see again it

Â appears that there seems to be some kind of trend with age and wages.

Â But you also see, one thing that

Â you notice frequently from making plots like this.

Â Here's some very strange patterns.

Â So you see there's this big chunk up here of observations that

Â appear to be very different than the relationship down here for these chunks.

Â So one thing that we might want to do is try to figure out why

Â there, there's that strange relation between ages

Â and wages before we build our, our model.

Â So, one thing that we could do is, for example,

Â using the ggplot2 package, color that plot by different variables.

Â So again, I plotted age versus wage, so on the x axis is age, on the y axis is wage.

Â But now I've colored it by the job class.

Â And so you can do that with, by

Â passing the parameter color to the two plot function.

Â And so what you see now is that most of the individuals that are up

Â in this other chunk, come from the information

Â based jobs as opposed to the industrial jobs.

Â So that might explain a lot of the

Â difference here between, these two big classes of observations.

Â So this gives you a way to sort of

Â detect variables that might be important in your model.

Â Because they show, variation in the data.

Â So you can also add regression smoothers.

Â So, for example, now, what I've done is, I've again made a

Â plot of age versus wage, but now I've colored it by education.

Â And so then what I can do is I can use

Â the geom_smooth function to apply a linear smoother to the data.

Â You would've learned about this in Exploratory Data Analysis, but

Â if not, you can just see the function right here.

Â And so what that does is for every different education class, it

Â fits a linear model, so you can see there's a purple line here.

Â And that corresponds to people with advanced degrees and then you can see for

Â example a green line here, that corresponds

Â to people with some college and so forth.

Â And so you can see if there's a different relationship for different age groups.

Â 5:17

The other thing that you might that is

Â often very useful is to break up things like

Â the wage variable into different categories cause sometimes it's

Â clear that specific categories seem to have different relationships.

Â The way I tend to do that is with the cut2 function, that's in the Hmisc package.

Â So if you load the Hmisc package and then use cut2, you can tell

Â it with the g parameter how many groups to break the data set into.

Â And it'll break the data set up into factors based on quantile groups.

Â So all of the observations that land between 20.1 and

Â 91.7 on the wage variable will get assigned to this

Â factor level and then all the values between 91.7 and

Â 118.9 will get assigned to this group and so forth.

Â And so, what you can do now is you can actually

Â use that to, in order to make different kinds of plots.

Â So now, suppose I wanted to plot wage versus, oh sorry, wage groups versus

Â age, I can now, use qplot again but now I can pass it the box plot geometry.

Â And then I can say okay, I want to see the plot of these different wage groups

Â versus age and sometimes that can make it

Â easier to see different trends that are emerging.

Â For example, you can see here a little

Â bit more clearly the relationship between age and wage.

Â 6:31

The other thing you might want to do is you might want

Â to add on top of the box plots, actually the points themselves.

Â The reason why you might want to do

Â this is because sometimes box plots can obscure how

Â many points are being shown here and so one thing that you can do is you can

Â say, pass it both box plot and jitter and you can have it, arrange the plot so you

Â can see both the box plot itself and you can see the box plot with points overlaid.

Â So that's what grid arrange is doing, it's actually generating the two plots.

Â So p1 was the plot that we made on the previous slide.

Â And p2 is the plot we made here with points overlaid.

Â And grid arrange makes two plots side by side.

Â So you can see here from the dots that there's a large number of dots in each of

Â the different boxes and so that suggest that this

Â trend, any trend a user may actually be real.

Â If you observe just one of a few dots in the boxes it means maybe

Â that that particular box isn't very well

Â representative of what the data actually looked like.

Â Another thing that's very useful is you can use the cut variable,

Â the factorized version of the continuous variable to look at tables of data.

Â So here I'm making a table comparing this factor

Â version of wages to the job class and so I

Â can see for example that there are more industrial jobs

Â in the lower wage variable than there are information jobs.

Â And that trend reverses itself for the highway jobs.

Â There are fewer industrial people and more information people.

Â You can also use prop table to actually get the proportions in each group.

Â So here it's the proportion, by passing it one,

Â I say I want to get the proportion in each row.

Â So if I passed it at two here, it would give me the proportion in each column.

Â So here I see that 62% of the low wage

Â jobs go, correspond to industrial, and 37%, 38% correspond to information.

Â And so you can use that to get an

Â idea of how those proportions change across different wage levels.

Â 8:24

Finally you can use density plots to plot the values of continuous predictors.

Â So here again I'm using the qplot function.

Â I'm plotting the wage variable, and I'm plotting a density plot, versus education.

Â So this basically shows where the bulk of the data is.

Â So on the x axis is the wage.

Â And on the y axis is sort of the proportion of

Â the variable that falls into that bin of the x axis.

Â And so you can see, for example, the high school grads tend to have more values that

Â are down here in the lower part of

Â the range, and the advanced degree folks tend to

Â be a little bit higher, and there's also a group, outgroup over here that tends to be

Â very high for both the advanced degree as well

Â as the college grads which is shown in blue.

Â So sometimes density plots can show things that box plots can't necessarily do.

Â It's also easier to overlay multiple

Â distributions when you're doing density plots.

Â In other words, if you break things up into a bunch of different groups, and you

Â want to see how all the distributions change

Â by group, density plots can be very useful.

Â So, one thing to keep in mind is to make your plots only in the training data.

Â The test set, again, can't be used for exploration.

Â That would be similar to training your model on the test

Â set, which, as we should, talked about previously, will lead to overfitting.

Â Things that you should be looking for in these

Â plots is imbalance in the outcomes of the predictors.

Â If you see all of the predictors tend to be one

Â value in the one outcome group, and not another outcome group.

Â Then you see that's a good predictor.

Â But if you see that, you only have three of one outcome and 150 of the other

Â outcome, that means it's going to be very hard

Â to build an accurate classifier between those two classes.

Â You're looking for outliers or weird groups outlying the data

Â that might suggest that there are some variables you're missing.

Â And groups of points that are not explained by any of the predictors.

Â Skewed variables which you're going to want to transform and

Â make look better, more sort of nicely normally distributed if you're

Â using things like regression models but that may not matter

Â as much as if you're using more of machine learning methods.

Â For more information on plotting in general

Â you can look at the ggplot2 tutorial.

Â You could also take the exploratory data analysis class in

Â this data science specialization or you can look at the

Â caret visualization tutorial, because that'll give you a little bit

Â more information about prediction specific plots that might be useful.

Â