0:10

Hello. In this lesson,

we're going to look at using the Numpy module

within Python to explore higher dimensional data.

This will also include

the analytical quantification of correlations between different dimensions of a dataset.

So a good way to think about this is,

if you have a DataFrame and you want to understand whether two columns are correlated,

one technique is of course,

to make a scatter plot and look at it visually.

A second technique, is to actually compute correlation measures analytically.

Now, this notebook would be very similar to a previous notebook where we introduce Numpy,

however, now we will be looking at Numpy in a multi-dimensional setting.

Most of the time, we will be focusing on Numpy as

a two dimensional array which we often think of as a matrix.

So, all of these will be contained in the advanced Numpy notebook.

As I said before, this is going to focus on multi-dimensional arrays.

And in general, we're going to stay focused on a two-dimensional array.

But whatever we do,

will be easily extendable to higher dimensions.

One of the standard ways to create

a multi-dimensional or two-dimensional array is to start with

a one-dimensional array and then to reshape

it as we need for whatever analysis we're doing.

So here's an example, We first create a one dimensional 100 element array and then,

we reshape it into a 10 by 10 array.

This also has 100 elements.

And what we do is, we take the first 10 elements and make that a row and then,

the second 10 elements and make that the second row etc.

until we reach the conclusion.

This is demonstrated in this example.

You can see we've printed out a 100 element array and then,

when we turn it into a two-dimensional array and we print that out,

you can see how this is nicely organized.

There's also convenience functions for creating special matrices.

So if you want to create an identity matrix where

the diagonal elements are all one and the off diagonal elements are zero,

we do that with the np.eye.

The reason we have eye,

is because the identity matrix typically is represented by a capital letter I.

We can also use the Numpy methods to create

diagonal matrices with different diagonal elements,

and we can also create matrices with off diagonal elements being certain things.

Now, one of the more challenging things when dealing with multi-dimensional arrays,

is when you're trying to index or slice them.

And the important thing to do,

is that we're going to specify

the first dimension followed by a comma and then, the second dimension.

If there are additional dimensions,

we'll just be using a comma again.

So you can literally keep doing this until you run out of dimensions.

If only one dimension is specified by default,

it refers to the first dimension.

So we demonstrate this in the following code cell where we

first build a two-dimensional array.

That's three by three and we printed out,

here you can see it, zero, one, two, three, four, five, six, seven, eight.

And then, we start slicing in the first dimension,

in the second dimension, and in both dimensions.

So you can see we slice out the first row,

then we can also slice out the first column.

And how are we doing this?

here, we just grab the first element.

That's of course, going to refer to that first row.

Here, we put the colon which says,

select all rows, but now we're only selecting the second columns.

That gives us just the second column.

And of course, we can slice out individual elements in different ways.

We can also do a more complex slicing.

This case, this is a three by three by three matrix.

So it's a three-dimensional array.

And this example shows how to slice that.

We also can do standard things that we've typically done in Numpy including,

the masked arrays where we might want to grab out

certain elements where the element is greater than four,

or it's evenly divisible by two.

This shows how we do that.

We can also perform arithmetic on these masked arrays.

We can also perform the basic operations that we've done before,

where we're adding to each element,

or multiplying by each element.

And of course, we also can apply the summary functions that we've seen before, mean,

median, variance, standard deviation as well as

other universal functions like sin, co-sin, etc.

This makes it very easy to perform

complex analysis on these two or higher dimensional arrays.

The next thing that we're to look at is masked arrays.

We saw this in the one dimensional where we

can indicate that an array should be masked such that

operations that might trigger an error condition will instead

simply create not a number variable in our output array.

Now, the last thing I want to mention is actual correlation measurements.

So in a scatter plot notebook,

we saw visually how to interpret relationships between data

but we can also make an analytic quantification of this relationship.

The two main techniques for doing this are

the Pearson correlation coefficient and the Spearman correlation coefficient.

And this notebook shows you how to do that.

First, this is a figure taken from Wikipedia that shows

the Pearson correlation coefficient which is often

written with lowercase r for two different datasets.

You can see that these lines,

these straight linear relationships depending on

whatever the angle is, are positively correlated.

They have a value of r equals one.

When it's a negative correlation,

you can see it's minus one.

And as we go between these,

the value gets smaller until zero and then,

it gets more towards negative one.

And then of course, if there's just no real relationship at all,

the correlation coefficient is zero.

The Spearman correlation coefficient is similar.

However, the values change in that as x increases,

y will continue to move in the same manner.

So this has to imply a monotonic relationship as opposed to the Pearson correlation.

We typically write this as a row value

and we will use it in other tests, particularly hypothesis testing.

Now, one of the other things that goes along with correlations,

are co-variances where we measure what is

the relationship between two data sets where one variable is increasing,

what happens to the other variable?

So we've measured mean values before,

and we've measured variances before,

but when we have two-dimensional datasets,

we now have to measure the relationship between those two dimensions.

So how does one variable affect the other?

And we represent that with the co-variance.

We can easily calculate that with scipy.stats module.

We can calculate both the Pearson and

the Spearman coefficient as well as the covariance matrix from within Numpy.

And that's what this last code cell here does.

Hopefully, this has given you an introduction to two-dimensional,

or higher dimensional arrays.

They are a little more complicated than one mentioned,

but they enable much more detailed analysis of datasets,

provides you a richer toolset with which to attack your data problems.

If you have any questions with this material,

please let us know in the course forum. And good luck.