0:10

Hello. This lesson is going to

introduce some more advanced features in the Pandas module.

Specifically, we're going to look at how to select rows based on features and columns.

We do that by masking.

We're going to look at how to take

different columns from a data frame or even a single column,

and separate the analysis of the rows in the data frame based on those groups.

We're also going to quickly look at how do you combine or

stack data frames together either by extending,

or adding new columns to an existing data frame,

or by adding new rows to an existing data frame.

And lastly, we're going to look at a powerful concept known as "pivot

tables" which are prevalent in the spreadsheet world.

We're going to look at how to do those within Pandas.

This is all contained in the advanced Pandas notebook.

The start of this notebook repeats

some previous work in the introduction to Panda's notebook.

In this case, we're going to start by grabbing data from this website,

and we're going to specify an index column out of the data set itself,

it is the airport code.

We can then take a look at this data frame.

And as you can see, our index column is now the airport code.

And this data set has several columns.

And we can select based on those.

We can also look at what the data types for each column are.

The first four are objects.

They're holding string values which are variable length.

The last two are actual numerical columns.

We can also provide a summary of the numerical information in the data frame.

This data frame was created with a labeled row index,

so we can actually slice the data frame

by selecting the rows corresponding to those labels.

And, of course, we can use implicit labels,

and we can then select columns and rows by using

either explicit or implicit row and column indices.

Now, the first thing we want to do is talk about selecting rows based on column values.

And the way we do, this is with masking.

This is exactly the same way we did it with NumPy

when we were selecting rows based on values of columns.

And we call it a "mask, " because we're saying for

all rows where this case the state is equal to DE,

which is the abbreviation for the "State of Delaware,

" select those rows.

And we could see that when we execute this code, that's what we get.

And we can build more complex mask's.

Here we're saying if the latitude column is greater than 48 and

the longitude is less than minus 170, select the rows.

There are other functions,

one of them is very useful called the "Sample," which just randomly selects and rows.

And we can sort a data frame based on the index or by a particular column.

And these cells show how to do that.

Now, the next thing that I want to look at is actually applying

computations to a data frame based on the value in a particular column.

This is called "Group By."

Because what we're going to do is take a data frame and effectively divide it

into two or more chunks based on the value of a column.

To demonstrate this, we're going to use the tips data set again.

And the first thing we're going to do is let's group by time.

If you recall, time can take two values, dinner and lunch.

So, we create a group by object simply by specifying the column that we want to use,

and then we operate on this new data frame group by

object just the same way we did as if it was a data frame.

So, when we take the first two rows we get the first two rows for each of the groups.

If we take the last three rows,

we get the last three rows for each of the two groups.

We can also operate on that.

We can take statistical measurements such as the median or the standard deviation.

And as you can see, we get the summary representation for both of them.

Now, notice that it's actually applying it to all the columns,

all of the numerical columns not just one.

That's because we have applied it to the entire data frame group by object.

We can also apply a list of functions in this manner where we specify

a list of the function names and the aggregate

applies these two together to the data frame.

So, we have total bill, the mean standard,

and then for the tip,

the mean and standard and then for the size, the mean and standard.

We can also just call the described method,

and of course, it breaks it down,

the statistics by either the lunch or the dinner.

So, that's a powerful way to analyze a data set and see

the differences between the rows based on the values of the column.

And so, there's some other things that you can do

that are talked about in that particular section.

The next one is stacking,

where we might want to extend the data frame in

the column direction or in the road direction.

And this section talks about how to do that.

That's somewhat straightforward.

Here, what we do is we take our tips data set and we split it into two chunks,

what we call t r one and t r two,

and all we've done is said the first 200 rows go into the

first one and the remaining rows go into the second one.

And we could see that here with these two tail and heads.

And then we can call concat to do a vertical stacking,

where we're saying "take

these two data frames which are identical in terms of columns and stack them,

" and you can see that result.

We can do the same thing in the horizontal section.

In this case, we split based on the column into two new column data sets.

And we can see how they look and then we can concat those.

The only difference is we specify this axis equals one to specify.

Now, we're going to extend it in the column direction not the row.

X is equal zero would be row.

The last thing that I want to talk about is pivot tables.

They're a popular technique for generating summary information.

The simple thing that you need to do here to create

a pivot table is to call the pivot table method.

The first thing is the column that you're going to be summarizing.

The next one is index,

this is what becomes the index of your pivot table,

and then what are the columns you're going to include.

So, here you can see that our columns are

the four different types of days that are in our data set.

And the rows are the index column.

And it only takes two values male or female, because it's gender.

And we get a summary representation in here.

We can also reorder those,

so that gender is across these columns,

and day is now rows.

We can also do aggregate information,

so we can have the same data as before.

But now, we have what are called "margins" where we have the value

for the entire day of the week not broken out by gender,

the average of all of them, or again,

what's the average of all the males not broken out by day of the week.

The pivot table can include multiple columns.

So, we can do this on either the index where we now have Thursday.

And Thursday is now broken out by is the person a smoker or not.

We can also do it in the column by doing this similar thing,

but now we specify multiple columns,

such that we have the gender and then we have is a person a smoker or not.

We can also apply aggregate functions.

So, here we're going to take the total bill column,

and we're going to say we want the max, median,

and mean for the total bill displayed and you can see that's what we have.

We have the total bill column where the max median and mean.

We also then specify the gender and then the rows are day,

because that's what the index was.

You see that there's a very powerful thing going on here where we've built

this summary information or visualization,

if you will, of our full Pandas data frame.

Is the data changed?

The information in the pivot table will also change and provide that summary information.

Hopefully, you've seen the power of

this data frame and of the more advanced features in Pandas.

It really is something you want to learn and be able to apply,

because it can simplify a lot of task that you might want to be doing.

If you have any questions,

let us know in the course forums. And good luck.