0:04

Next up on our course of H2O supervised

machine learning algorithms is GLM, Generalized Linear Model.

I'm sure everyone is familiar with normal linear models,

you almost certainly did it at school,

plot some points on graph paper,

and try to find the best straight line algorithm that lines angle,

is what we're going to call the coefficient.

Of course, once you move into the computer,

you can do a lot more dimensions.

The generalized linear model

just takes that linear model idea and extends it in a couple of direction.

First and most importantly,

you can specify a statistical family,

where a gaussian is normal distribution,

this is the people and it's your typical linear model.

You can also specify a link function,

which is strongly related to the family.

We can also specify regularization.

So, alpha is the balance of L1 and L2 regularization.

If you don't know what L1 and L2 are,

I'm going to leave that as theory that you need to go study yourself.

Please do, because it's very important and it will come up again in Deep Learning.

Lambda is how much regularization we want,

and the H2O GLM implementation comes with a lambda search.

So, we can try and find the optimum value of lambda.

We're going to look at that in a later video.

What I want to look at in this video though

is using H2O not for predictive machine learning,

but for exploratory data analysis.

So, I've got this nice data set or perhaps very unnice data set,

it's talking about the deaths from lung cancer and trying to associate it with smoking.

It actually comes from 1964 Canadian data set.

First column is their age,

you'll notice this is a factor, a category.

Second column also a category,

if they smoke or not, no,

just cigar and pipe,

cigarette with cigar and pipe,

or just smoke cigarettes.

Our data set only has 36 rows.

What we have next are cans.

This column, the third column is the total population.

In this case, nonsmokers who are 40-44.

This, I believe, is in hundreds or in thousands.

The next column is the number of deaths from lung cancer in that year, in record.

The fifth column I added it's basically this one divided by

this one and scaled to be a 0-1,000 range.

I scaled it, made it an integer because the part I want to do with the data next.

3:34

Going out to a session in a really.

I've already started H2O.

I've already loaded the data set,

which you'll find here and you can find more about the data set at this Princeton URL.

So, we can say we have six,

actually seven age categories and divide to- or more.

Yeah, sorry, that's wrong. We have nine,

eight categories for smoking or non-smoking categories and then three numeric column.

We are only going to be interested.

We're going to try and predict the proportion of lung cancer deaths,

based on the two factor column.

Just wanted to show you this which shows,

how we could sum a column,

this would work even if you are dealing with billions of growths spread across a cluster.

H2O would just take care of it for you.

I believe the population was,

men over 40 in Canada,

in about 1960, was 5.6 million.

So, this is in hundreds, doesn't matter for our purposes though,

the numbers are big enough to be useful.

So we'll define X and Y,

and then go make a model.

I'm dealing with kind's,

so that's why I've set the family to poisson.

Other than that, almost everything default.

If you wanted to see how it would be as a predictive model,

you could use cross validation.

With such a small data set,

I recommend you use fold assignment, modulo.

It's not ideal, it would be better if your data set was sorted first,

but if you leave at the default of random,

you're going to get folds of different size.

And if one of those is zero, you're going to get an error.

So this just avoid the error.

But anyway, let's go look at the model we've made.

6:05

So, we could use mean average error or mean squared error,

our range on that column,

remind me, was 13-557 with a mean of 204,

and were able to predict it to an accuracy of about 44 or 50.

Not too bad. But let's look at

the coefficient which is the thing we're trying to prove here.

That's a lot of numbers. I've got a better idea.

7:31

So, these are our coefficients,

when we have a factor column, it created a yes or a no,

this is one whole encoding.

It does it with them I think before,

you can see the biggest decider if you're going to

die from lung cancer or not is if you're 80 or over,

with 75-79 also being a very big factor,

being young 40 to 44 is the most negative decider.

So the top seven or eight coefficient,

say the age matters more than if you smoke or not,

as to whether you're likely to die from lung cancer.

But if we wanted to prove something here,

the biggest group of people who died from

lung cancer were those who only smoked cigarettes.

The non-smokers is the biggest negative group,

and then only smoking cigar or pipe is the next negative group.

It's a bit confusing because we're seeing positive and negative muddled together here.

So the biggest indicator if smoking matters or not was cigarettes only,

then cigarettes or cigarettes with cigar or pipe at zero.

Then cigar or pipe and then those smokers at all.

I feel I'm starting to lose you there.

So, what we're going to do is just make another model,

but this time just looking at if they smoke or not, ignoring their age.

9:37

And the coefficients are a bit clearer,

you can see it without having go to flow.

Biggest coefficients, is if they only smoke cigarettes.

If they smoke cigarettes and cigar or pipe, that's next.

Cigar or pipe comes next,

and cigar and pipe is almost as good for you as not smoking at all. Very interesting.

So to recap, the key point here was to be able to set family as possible.

Do try this again using

a gaussian or some of the other distribution, and see what result you get.

Let's just jump over and try this in Python,

again I've imported my data,

we can see some statistics on it.

This is how you do the sum of a column,

then when we want to use GLM,

we import it from H2O estimators GLM.

I set the two columns I'm interested in,

I'm using numeric column identifiers here.

Remember Python comes in zero,

I comes from one, Python comes in zero.

And then, it's just like the other examples we've seen,

specifying an object and encoding train.