0:00

This lecture is about covariate creation.

Covariates are sometimes called predictors and sometimes called features.

They're the variables that you will actually

include in your model that you're going

to be using to combine them to predict whatever outcome that you care about.

There are two levels of covariate creation, or feature creation.

The first level is, taking the raw data that you

have and turning it into a predictor that you can use.

So the raw data often takes the form of an image, or a text file, or a website.

That kind of information is very hard to build a predictive model around when you

haven't summarized the information in some useful

way into either a quantitative or qualitative variable.

0:40

So what we want to do is take that raw

data and turn it into features or covariates which are variables

that describe the data as much as possible while giving

some compression and making it easier to fit standard machine-learning algorithms.

So the idea here is, so suppose you have a email,

this is an email, an example email here on the left.

And so it's very hard to plug the email itself into

a prediction function, because most prediction functions are based on the idea

of taking a small number of variables and building a quantitative model

around them and it doesn't work for a free text for example.

1:13

So the first thing that you need to do is create some

features, and those features are just variables that describe the raw data.

So in this case, in the case of an email, we

might think of different ways that we could describe this email.

For example, when I calculate the average number of

capitals that are in the email, in this case 100%

of the letters in the email are capital letters,

you might say what's the frequency a particular word appears.

So for example, you might say, how often does you appear?

And you appears twice in this email, so we say that we calculate two for this email.

That's a feature.

You might also calculate the number of dollar signs.

This might be a really good predictor of whether an email is spam or not.

And so here you can see there are a large number of dollar signs,

there are eight of them, so we calculated another feature of that data set.

So this step, the raw data of the covariate, usually involves a

lot of thinking about the structure of the data that you have

and what is the right way to extract, extract the most useful

information in the fewest number of

variables that captures everything that you want.

The next stage is transforming tidy covariates.

In other words, we calculated this number, say capital

average, the average number of capitals in the data set.

But it might not be the average number that's

related very well to the outcome that we care about,

it might be the average number of capitals squared or

cubed, or it might be some other function of that.

And so the next stage is transforming

the variables into sort of more useful variables.

So for example, if we load the kernlab data

and the spam data set, we can take the

capital average, so the, this is basically this variable

right here, the fraction of letters that are capitals.

And we could square that number, and assign it to a new

variable, capital average squared, that might

be useful later in our prediction algorithm.

So those are the two steps in creating covariates.

So the first step the raw data,

the covariate really depends heavily on the application.

So like I showed you on the previous slide, in an email case, it

might be extracting the fraction of times a word appears or something like that.

In a case of voice, it might be knowing something about

the frequency or the timbre of which voices are typically fall.

In the case of images, it might be identifying features of the images.

So if it's faces, where are the noses or the ears or the eyes are?

And it will depend greatly what your application is.

And the balancing act here is definitely summarization versus information loss.

In other words, it, the, the best features

are features that capture only the relevant information in,

say, the image or the email, and throw out

all the information that's not really useful at all.

And so the idea is that you have think very carefully about how to

pick the right features that explain most of what's happening in your raw data.

3:55

So some examples here, for text files, it might

be the frequency of words or frequency of phrases.

There's this cool site, Google ngrams, which tells you about the

frequency of different phrases that appear in books going back in time.

For images, it might be edges and corners, blobs and ridges for example.

These are all ideas about how do you identify different structures in an image.

For websites it might be the number and

type of images, where buttons are, colors and videos.

This is a huge area of importance in web development which is called

A/B testing, which is called randomized trials

and statistics, which is basically showing different

versions of a website with different

values of these different features and predicting

which one will introduce a more clicks or get more people to buy products.

For people you can imagine features of

people are their height, weight, hair color, etc.

It's basically any summary of the raw data that you can make as a potential feature.

And often this involves quite a bit of scientific thinking and business

acumen to know what the right covariates are for a particular problem.

So the more knowledge you have of a system,

the better job you'll do at feature extraction in general.

In general it's a good idea to have a really clear understanding of why

this set of data is useful for, to predicting the outcome you care about.

5:12

So there's this balance between summarization and information loss, and in

general, it's better to err on the side of creating more features.

You lose less information and then filter some

of those features out during your model-building process.

This can all be automated and has

been automated in various different ways, but you

generally have to use a lot of

caution when using that approach because sometimes a

particular feature will be very useful in

the training set that you created but won't

be very useful in a new set of data and the test set won't generalize well.

5:41

So the second level is taking tidy covariates, so these are features you've

already created on the data set, and then creating new covariates out of them.

Usually this is transformations or functions of the covariates,

that might be useful when building a prediction model.

This can sometimes be more necessary

for methods like, regression, or support vector

machines that might depend a little bit more on what the distribution of

the data are, and a little bit less for things like classification trees,

where the idea here is you

don't necessarily have as much model-based prediction.

In other words, you don't depend quite so much on the data looking a particular way.

6:20

On the other hand, in general, it's a good idea to spend

some time making sure you have the right covariates in your model.

So you, when you create these functions or decide on these

functions, you have to do it only in the training set.

This is a common theme of machine learning.

Building features can only happen in the training

set, it can't happen in the test set.

Later when you apply your prediction to your function to the test set, you

will make that same function of the covariate so you can apply your predictor.

But the original creation or thinking about what covariates to build has

to happen only in the training set, otherwise you'll lead to overfitting.

And the best approach I've found is through exploratory analysis,

so basically making plots and making tables of the data, and

trying to understand what are the patterns of variation in

your data set, and how they might relate to the outcome.

When you're using the care package or

doing this analysis in r, the new covariates

need to be added to data frames so that they can be used in downstream prediction.

And it's important to make sure that the names of the new variables are

recognizable so that you can use the same name on your testing data set.

7:45

So one idea is that's very common when building machine learning algorithms is to

turn covariates that are qualitative, or factor

variables, into what are called dummy variables.

So you probably learned a little bit about this in your

regression modeling class if you've taken

it through this data science specialization.

But the basic idea is suppose we have a variable, in this

case let's look in the training set at the variable called job class.

So that job class has two different

levels, it's either industrial, or it's information.

So one thing that we could try to do is try to plug that variable directly into

a prediction model, but the values of that

variable will be a actually a set of characters.

It'll either be industrial, or it'll be information.

And it's sometimes hard for prediction algorithms to use those

qualitative information variables, in order to actually do the prediction.

So one thing we might want to do is turn it into a quantitative variable, and the

way that you can do that with the

care package is with this dummy variables function.

So basically it says we're going to pass in a model so the outcome is wage.

Job class is going to be the predictor variable, and tr, training

set is the set where we're going to be building those dummy variables.

And then if you predict, if you use the

predict function, this dummy's object and a new data set,

in this case we're just going to apply it to

the training data set, you get, two new variables out.

So the first is an indicator that you are

industrial, and the second is an indicator that you're information.

If the indicator that you're industrial is one, it

means that for that person, they had an industrial job.

If it's zero it means for that person, they had not an industrial job.

So the same thing is true for information.

If it's zero that means they had not an information

job, and if it's one, they have an information job.

So, in this case, where's there only two different levels

of this variable, there's only industrial and information, then whenever

you're one for industrial, you're zero for information, and whenever

you're zero for industrial, you're one for information and so forth.

But if you had three variables here, it would

probably have, every column would have two zeros, because

those are the two classes you don't belong to,

and a one for the class that you belong to.

So this is taking these factor or

qualitative variables and turning em into quantitative variables.

9:57

Another thing that happens is that some of

the variables are basically have no variability in them.

So it's often that you'll create a feature for example, if you create a

feature that says for emails, does it have any letters in it at all?

Almost every single email will have lots, have at least one

letter in it, so that variable will always be equal to true.

It's always got letters in it, so it has no

variability and it's probably not going to be a useful covariate.

So one thing that you can use is this near zero variable or function in carrot

to identity those variables that have very little

variability and will likely not be good predictors.

So you apply it to a dataframe that's the training data set.

And here I'm telling it to save the metrics so

that we can see how it's calculating what the variables are.

So, for example, here we can see that it tells us the percentage of unique

values for a particular variable, so in, in this case the variable has

about 0.33% unique values, and it's not, not near zero variable, near zero variance

variable, but for example, sex, the variable

sex, only is basically males and so

it has a very low frequency ratio.

In other words, it's basically all one category, and so, this ends up being

a near zero variable and so, it will be, you could use this column of the matrix

to throw out all those variables like sex and, in this case, like region that are

variables that don't really have any variability in

them and shouldn't be used in prediction algorithms.

So this is a nice way to throw

those sort of less meaningful predictors out right away.

11:44

The other thing that you might do is, so instead of

fitting, if you do linear regression or generalized linear regression as

your prediction algorithm, which we'll talk about, in a future lecture,

the idea will be to fit, basically straight lines through the data.

Sometimes, you want to be able to fit curvy lines, and one way

to do that is with a basis functions, and so you can find

those, for example, in the splines package, and so one thing that you

can do is create this, the bs function will create a polynomial variable.

So in this case, we pass at a single variable, in this case, the training set,

we take the age variable, and we say

we want a third degree polynomial for this variable.

So when you do that, you essentially get, you'll get a three-column matrix out.

So this is now three new variables.

This variable corresponds to age, the actual age values.

There are scales for computational purposes.

12:35

The second column will correspond to something like age squared.

So, in other words, you're allowing it to

fit a quadratic relationship between age and the outcome.

And the third column will correspond to age cubed, so

you allow a cubic relationship between age and the outcome.

So this'll, if you include these covariates

in the model instead of just the

age variable when you're fitting a linear

regression, you allow for curvy model fitting.

So just to show you an example of that, here I

fit a linear model, you'll remember that from your linear modeling class.

So, the wage is the outcome.

Again the tilde tells you what's we're predicting it with.

Here we pass it that BS basis, in other words, we

pass it all the predictors that we generated from the polynomial model.

So in this case, it's age, age squared and age cubed.

13:20

And then we can plot the age data versus the wage data.

So that's age on the x axis, wage on the y axis.

And you can see that there's, kind

of, a curvilinear relationship between these two variables.

And so we can plot age and the predicted values

from our linear model, including the, the curvy terms, polynomial

terms and you see you get a curve fit through

the data set as opposed to just a straight line.

So that's one way that you can generate new variables is

by allowing more flexibility in the way that you model specific variables.

13:53

So then on the test set, you'll have to predict those same variables.

So this is the idea that's incredibly critical

for machine learning when you create new covariates.

You have to create the covariates on the task data set

using the exact same procedure that you used on the training set.

So you can do that by saying I'm going to predict from this

variable that I created using the BS function, a new set of values.

This is the testing set age values.

So these are the values that I'm going to actually plug in to

my prediction model when I'm testing it out on the test set.

This is as opposed to creating a new set of, predictors based on

just applying the BS function directly to this age variable, which would be creating

a new set of variables on the test set that isn't related to

the variables that you created on the training set and may introduce some bias.

14:43

So a little bit about this idea, these ideas

and some future reading for you, So level one

feature creation is basically all about science or, application

specific knowledge, I've found that the best way to do

it, find things for a specific application, that I

haven't talked about here, or that you don't know

about, or you're new to, is Googling feature extraction

for the type of data that you're trying to analyze.

Feature extraction for images.

Feature extraction for voice.

Things like that.

you, you can also just look up that particular data

type and see as much information as you can about it.

In particular you're looking for what are the salient

characteristics that are likely to be different between individual samples.

In general you want to err on the side of overcreation of features

because you can always filter them out

later in the machine learning algorithm process.

In some applications like images and voices,

it's often both possible and pretty much necessary

to create features that aren't necessarily just

things that you imagine out of your mind.

It's very hard to know exactly what the right components of

an image to include as features in a model, and so there

are things like you may have heard of deep learning which is

basically a way of creating features for things like images and voice.

And this is a nice tutorial I've linked to here, that

kind of explains how that feature creation process works for those things.

But in general, automatic feature creation requires an equal level of thinking to

make sure that the features being generated

by your feature creation process make sense.

16:12

Level 2 feature creation covariates to new covariates can be

done a lot with the preProcess components of the caret package.

You can create new, new covariates using basically any of

the functions in r, if they make sense to you.

The key is, again, making lots of plots and doing exploratory analysis

to see where the connections between the predictors and the outcome are.

You can create new covariates if you think they will improve fit.

Again, you can kind of err on the side of overcreation of features,

but sometimes features just are, are, sort

of, nonsensical and you shouldn't create them.

16:46

Be careful about overfitting in the sense that

if you create lots of features that are

particularly good for just your training set, they

may not work well in the test set.

And so a good idea is if you overcreate lots of features

to do some filtering before you actually apply your machine learning algorithm.

This tutorial on preprocessing with caret is very good.

It's a good place to start for really basic preprocessing.

And if you want a flit spline model like the

ones I talked about with flexible curves, you can use

the gam method in the caret package which allows smoothing

of multiple variables using a different smooth for every variables.