0:06

Okay, let's continue with the airline's data set and deep learning.

But still predicting, is arrival delayed or not,

so, still doing a binomial classification.

But, let's look at a grid to see if we can narrow down the best set of parameters.

So, I'm going to do this in R and I've started H2O, loaded the data,

split it into training valid test,

set up the fields and I've excluded tail number,

as explained in the previous video,

because that was slowing us down a lot.

So, these are the 19 fields we are going to learn from.

Now, I'm just going to make a mini grid first,

to test that I have everything correct,

I have all the correct syntax.

So, I'm just using any epochs 0.01,

and just four models.

Let's start that going.

And you can see it's nice and quick. So, what am I doing?

We're using random discrete mode of grid, which means,

the combination of hyper parameters can make a lot of models,

I'll calculate that in a minute.

But we're only going to choose 12 of them.

Let's just jump over here and see what the grid has produced.

Interestingly, we get log loss ranging from point seven through to 2.43,

so, we've got quite a range.

Okay, let's jump back to looking at the model.

Seed is a placeholder,

I'm going to explain later.

L1 and l2 are regularization terms.

Generally, they make models that are more robust against noise.

And there will be links in the following material,

where you can go and learn the mathematical theories behind l1 and l2.

Then the other type of regularization we're using, are dropout ratios.

So, we're going to try four different values of input dropout.

So, if you remember we had, I think,

300 input neurons, default is zero input dropout ratio,

means we use all 300.

But if we set it to point one,

it will randomly drop 10 percent of those.

Point two means 20 percent,

point three means 30 percent.

So, if we've set it to point three,

and we have 300 inputs,

it will be dropping 90 of them and only using the remaining 210 every time.

And then, hidden dropout ratios is the same idea,

but for the next two layers.

Now, the thinking behind dropout,

is it stops any one neuron,

getting a lot of importance.

So, it forces the neurons to act together as a team and learn things jointly.

And again, this will make it more robust against noisy data.

So, those are the four things we're going to experiment with, in this first grid.

So, I'm going to make 12 models.

Most of our models we're starting to over fit, well,

at around 10 epochs and most of them had finished building by 50 epochs.

So, I'm going to set it explicitly to 40, just to keep it moving.

Setting the training frame to our train data validation frame,

this is the activation function,

rectifier is the default.

If you want to use hidden dropout ratios,

you need to change this to be rectifier with dropout.

4:26

And, if you remember from the previous videos,

it wasn't much of a difference but 400 neurons in two layers,

was the best performing model,

so, that's what we're using here.

Okay, and I'll just give it a name,

I'm calling it DLB for no particular good reason. Let's try building that.

And that's going to take a while,

so I'll come back and show you the results in a moment.

Okay, 17 and a half minutes later,

welcome back to my sad little world of watching

a progress bar inch slowly across the screen.

Let's see how it did. Took a while to build those 12 models.

What we get when we look at the grid iput,

let's move this over a bit actually, is,

we get to see what values of each parameter gave us what log loss.

So, just skimming through,

we can see the zeros,

for hidden dropout ratio, over in the middle.

We have a point four near the bottom and point fours near the top, not very helpful.

Input dropout ratio, on the other hand, is very interesting.

Our best four models all use zero,

our worst three models,

all use the highest value, point three.

It's really pointing to the idea that,

we want to use all of our 300 input columns.

What about the regularization parameters?

Same value top and bottom.

Higher values one times tend to the minus 5,

was the highest value were used for l2.

That's coming down the bottom, in the bottom half.

So, maybe a smaller value for l2,

but no real ideas for l1.

What we do notice,

is quite similar set of parameters, say, this one,

where they only differ by l1,

otherwise identical, give us a large difference between point 58 and point 35.

This one has the same l1,

l2 as this one,

so, we count point 36 to point 551.

What I'm sensing, is there's quite a bit of noise in

this log loss. But, let's push ahead.

So, I'm going to drop point six from the hidden dropout ratios,

drop point 23 and point three from input.

I'm going to make eight more models.

It's important I keep the same grid ID which have gone as DLB.

And as the comment says,

I've changed the seed,

so that if the random variation,

the random discrete does give us exactly the same set of parameters again,

we'll be able to compare and judge just how much that random noise element is.

But yes, all I've done on the second grid,

is remove some of the choices for hyper parameters,

I haven't added anything new.

So, let's give that one a run.

I'm estimating, I'll see you in about 12 minutes.

In case you ever see this happen,

the grid is telling me it's 100 percent,

but it's still building the last model.

So, if you see 100 percent and it seems to have hung,

just be patient for a couple of minutes.

Here's the results of a previous one of these grids,

and best one came out with zero hidden dropout ratio.

We've already established point two on the dropout ratios seems to be good,

with a smidgen of l1 and a smidgen of l2.

And there she goes.

So, yeah i can't do that in my head,

but it took quite a while, again.

So, we made 12 models initially and then we've made the eight models,

trying to narrow in.

You can see, of the additional eight, these were the best three.

We can tell that by the seed or we can look at the model number.

Because the model ID is the grid ID,

underline model and then a sequential number.

So, comparing this to the previous run,

we have actually got the same l1,

l2 values at the top of the grid.

Anyway, this is our best model,

if you want to extract it,

let's move this back over, okay.

We're going to run this command to extract it.

So, our grid contains the model IDs as a list,

it's already sorted, so I grab the first one.

And then I pass that to h2o getModel.

And then I can save that model,

and it's been saved with that filename.

We can also, evaluate it, i mean,

on the valid data set and then on the test data set.

16.6 percent error on test,

16.5 percent on validation.

Validation and test errors are roughly the same, which is a good sign.