0:04

In this video, I'm going to show two things.

The first, data manipulation and the second,

I'm going to introduce you to the airlines data

set that we're going to be using for the next few videos.

This is the help page for the python data manipulation functions,

and I find it by googling for H2O python docs.

I think it was the third or fourth link there.

So it's the H2O-py docs, and you want to be in a frame.

I mention that because I'm not going to go exhaustively through all the functions.

I'm just kind to pick out a few.

So let's start H2O,

just as we normally do.

0:58

And this is the data set.

Airlines, allyears2k headers, a zip file, H2O.

This is why we love you H2O.

It will take care of that and find the CSB file inside.

It's described as small data but I think it's got,

oh well, let's find out how it is.

Give it a moment. There we go.

What have we got? Lots of

columns and it doesn't tell me how many rows.

Go ask Data.nrowd.

1:47

Okay, 44,000 rows.

So a couple of order and magnitudes more than our lists.

Dates, departure time, arrival time,

these are obsolete numbers, not so useful.

What we're interested in, unique carrier.

I will just mention because we're going to come back to that in a moment.

Flight numbers. Elapsed time is interesting.

This is what it should have been and this is how long that flight actually took.

How long it was in the air?

How late it was arriving?

I believe this is in minutes.

So, a negative number would mean it arrived early and late was departing.

Some airport names. Where it came from?

Where it went to? How far it was flying kinds of stuff.

And these are if we do a binomial classification,

this is most likely what we will be trying to learn.

Was it late? Is the arrival delayed? Yes or no.

If we do a regression,

chances are we will be trying to predict either the arrival delay or departure delay.

Come down a bit more.

First useful function.

H2O does generally a good job at detecting your data column.

For instance, this one, full of yes and nos,

it detected it was an enum,

also called a factor, also called a categorical.

A lot of the other columns,

its detector is integer.

3:50

There are no floating point numbers in this particular database, they're all integers.

Unique carrier is detected as enum.

So I am coming on these out because don't actually have to do anything,

but if it gets it wrong and you want to convert

a numeric column to a factor, you use this command.

And if it's the other way, it turns less likely,

but if it's made a column a factor,

when it should have been numeric, you use this function.

Let's just hop over to R and take a look,

get the summary, a different layer to the way Python shows it.

Sometimes I find the Python way easier to understand, sometimes the other way.

4:51

When you're converting data with the R API,

it's almost the same.

You're going to need this comma there and the comma there.

And the as.factor is a function.

It's a global function, I should say.

And the data column you want to convert is the argument to it.

Stick with R, we'll just run this line,

trying to get the mean of the airtime column.

I already know it from the summary, of course.

It is, rumble please, airtime.

There we go, 114.3.

But there are 16,649 NA's, missing data.

That's why we got not a number.

To get around that, we say we want to ignore the NA's.

This is the mean of the remaining,

whatever it is, 35,000 columns.

We can use the function mean,

it's a synonym for H20 mean.

So, this function does exactly the same.

We've got the range function.

I should say these calculations are happening on your H2O server in the cluster.

If your data is really big,

it's not being downloaded as the R client and the calculations done there.

They're all happening remotely.

Let's just jump back to Python and see those commands.

More object oriented, so we select our column and run the mean function on it.

And I believe that's identical.

I couldn't find the range function,

but you can use summary and get the min and the max.

7:12

We can see most flights are a little bit late,

with a very long tail.

Wouldn't they have drawn the histogram over here if there wasn't something here.

We'll jump back and see that in R. There we go.

Oh, this is a different field airtime.

Here we've got two bumps telling us most flights are short, some are long flights.

7:55

We can do more than one column at a time.

So, arrival delay and departure delay.

Ignore this error message, that's just to ask you they are complaining about the plot.

So the arrival delay was 9.3,

the mean of it and the average departure was 10 minutes late,

8:30

We can also do some logical questions.

This creates a logical vector,

the same length, the same number of rows of data.

It will be a one,

if the flight was delayed more than six hours, 360 minutes.

It will be a zero if less.

And then we ask any,

are any of the flights,

were any of the flights delayed more than six hours? A one means yes.

Is it rephrasing? Where all the flights delayed no more than eight hours?

And we get false. Read the comment.

The problem is we NA's in this.

If we get rid of the NA's, we get true.

None of the flights were delayed more than eight hours.

9:37

Always bear in mind your NA's because this can give you the wrong result.

I was nearly tricked, but I remember seeing 475 arrival delay,

the maximum was 475.

So I knew I was expecting a true.

Well, let's just say what this does.

Cumulative sum, it's adding the numbers up.

The first row was 23 minutes late,

the next one must have been 14,

23 plus 14 , 37,

and so on. Let's keep this moving.

You'll find this file somewhere.

Come and play with it yourself afterwards.

This is how to do a correlation between two columns.

10:37

And again, we need to get rid of the arrival delay,

departure delay, highly correlated.

Come back to Python. That's how you do the same example there.

And this time, I'm doing three columns,

rather than specifying two arguments,

the correlation is specifying three columns of the same data frame.

It gives me this nice table.

11:11

We can see arrival delay, departure delay,

correlated but there's a very low positive correlation with the length of a flight.

Okay. I think that's enough. Study the manuals,

get familiar with the functions

or just go hunting for a function when you find you need something.