So the first thing that you need to do is create some

features, and those features are just variables that describe the raw data.

So in this case, in the case of an email, we

might think of different ways that we could describe this email.

For example, when I calculate the average number of

capitals that are in the email, in this case 100%

of the letters in the email are capital letters,

you might say what's the frequency a particular word appears.

So for example, you might say, how often does you appear?

And you appears twice in this email, so we say that we calculate two for this email.

That's a feature.

You might also calculate the number of dollar signs.

This might be a really good predictor of whether an email is spam or not.

And so here you can see there are a large number of dollar signs,

there are eight of them, so we calculated another feature of that data set.

So this step, the raw data of the covariate, usually involves a

lot of thinking about the structure of the data that you have

and what is the right way to extract, extract the most useful

information in the fewest number of

variables that captures everything that you want.

The next stage is transforming tidy covariates.

In other words, we calculated this number, say capital

average, the average number of capitals in the data set.

But it might not be the average number that's

related very well to the outcome that we care about,

it might be the average number of capitals squared or

cubed, or it might be some other function of that.

And so the next stage is transforming

the variables into sort of more useful variables.

So for example, if we load the kernlab data

and the spam data set, we can take the

capital average, so the, this is basically this variable

right here, the fraction of letters that are capitals.

And we could square that number, and assign it to a new

variable, capital average squared, that might

be useful later in our prediction algorithm.

So those are the two steps in creating covariates.

So the first step the raw data,

the covariate really depends heavily on the application.

So like I showed you on the previous slide, in an email case, it

might be extracting the fraction of times a word appears or something like that.

In a case of voice, it might be knowing something about

the frequency or the timbre of which voices are typically fall.

In the case of images, it might be identifying features of the images.

So if it's faces, where are the noses or the ears or the eyes are?

And it will depend greatly what your application is.

And the balancing act here is definitely summarization versus information loss.

In other words, it, the, the best features

are features that capture only the relevant information in,

say, the image or the email, and throw out

all the information that's not really useful at all.

And so the idea is that you have think very carefully about how to

pick the right features that explain most of what's happening in your raw data.