0:03

In this case study, I'm going to talk more

about exploratory data analysis techniques, and how to

use them, on a data set that involves

using smart phones, to, kind of, predict human activities.

So remember, just any exploratory data analysis, you have to

have a sense of kind of like, what you're looking for,

what might, and what might be the kind of the

key priorities that you want to get outta your data set.

And so that will help you guide kind of what you looking at and how you

approach it um,remember that,the basic idea

of exploratory data analysis is you want

to kind of produce a rough cut of the kind of analysis that you

ultimately maybe want to do so maybe this isn't going to be perfect

its not going to have all the right bells and whistles to it.

But it's going to give you a rough idea of kind of what

in, kind of information you're going to be able to extract out of

your data set and what kinds of questions you're going to be

able to feasibly answer and, and what questions might not really be possible

to answer with the given data set.

So, so exploratory data analysis is really important because it rules

out certain questions, and it kind of pushes you along other directions.

It really allows you to give you that rough

cut analysis that can,can take you to the next step.

So let's take a look at the Samsung data

set in this example and see what we can find.

1:19

So um,the data set here comes from the

University of California Irvine or U.C.I. machine learning archive.

And it's based on predicting people's movements.

be, from the Gal, from the Galaxy pho, Samsung Galaxy phones.

So, here's a picture of the Samsung Galaxy S3.

The actual data set was was produced using the Galaxy

S2, and but the,the idea is kind of basically the same.

So, in each of these's, phones.

There's an accelerometer and a gyroscope.

And so it helps you kind of, to understand the

kind of three dimensional position and acceleration of a person.

Assuming that they are holding their phone.

1:57

So this is where the data set comes from.

This is the UCI machine learning repository.

you can go to the link to learn a little bit more about the data set.

How it was collected, and kind of what is available on the website.

And so, we, I've downloaded

a subset of the data, which is just the

training data set for the purposes of this lecture.

2:18

So the data been processed a little bit to make it a little bit easier to use.

Basically you get a matrix, The, that has kind of, has

the observations on the rows and the various features on columns.

And you see that at the bottom here, I've

got the activity label which is the kind of the,

for each row that tells you what the person was doing at that time.

And so for example there is six possible activities

that you can be doing; there's laying, sitting, standing.

Walking, walking down and walking up.

2:49

And, the ideas that you, you want to

be able to kind of deter,separate out these six

activities based on the many features that

are collected by the accelerometer and the gyroscope.

And so

the, the I listed the first 12 features here.

I can see that that they have body acceleration as the mean

standard deviation, the mean absolute deviation,

the maximum of each of these features.

3:15

So one thing we can do really quickly is just

to look at the average acceleration for the first subject.

So the first thing I'm going to do is

just, convert the activity variable into a factor variable.

And then using the transform function.

And them I'm going to just subset out the the first subject.

So subject equals one and I'm, for the rest of this presentation

I'm just going to ignore the rest of the subjects for a moment.

Um,and so.

If I plot the first

subject, I can look at the first column, and that's

the, first column is the body, the kind of the body

excel over the mean body acceleration in the x direction,

so acceleration's going to divide into three dimensions, x, y, and z.

3:52

And then the second plot here is going to

be the, body excel, the mean body acceleration

in the y direction and, and I've color

coded each of the, activities, by I'm sorry.

I've color coded each of the activities.

So you can see for example, on the left hand plot, that there's

green, there's red, black, blue and some

alternate activities so part of the problem

with the left hand plot is that you can't tell which activity is which,

so on the right hand plot, I added a legend, using the legend function.

Just so you can figure out kind of which Activities correspond to which color.

And so you can see that the green is

standing, the red is sitting, the black is laying, etc.

And so you can see that, for example, the mean body acceleration is ah,relatively

kind of uninteresting for things like sit,standing and sitting and laying.

But for things like walking and working

down and walking up, there's much more variability.

In the, in the mean body acceleration for the x direction.

4:50

We can try to cluster the data, just on the average acceleration.

So, I've taken just the first three columns of this matrix,

and I calculated a distance matrix using the DIST function.

And I'm using a Euclidean distance just as the default.

And I can call the hclust function to

do, to do a hierarchical clustering of these data.

And I've called this my plclust function just to visualize it.

And you can see that the clustering is a little bit messy.

And there isn't any kind of clear pattern going on.

All the colors are kind of jumbled together at the bottom.

And so we might need

to look a little bit further to try

and kind of extract more information out of here.

5:26

Another thing we can look at is the

maximum acceleration for this, for the first subject here.

And so I look at, I'm plotting here columns ten and 11.

And and so you see that column ten is the body, the maximum body acceleration in

the x direction, and, and column 11 is

the maximum body acceleration in the y direction.

And so you can

see that again for things like laying and standing

and sitting, there's not a lot of interesting things going

on, but for walking in, and walking up, and

walking down, the maximum acceleration shows a lot of variability.

So that may,may be a predictor of those kinds of activities.

But maybe early separating, kind of not moving from

moving, which might be kind of obvious in retrospect.

Um,so if you cluster

the data based on maximum acceleration, you can see that there's two very clear

clusters on the left hand side, you've

got the, kind of the various walking activities.

And on the right hand side you've got the

various, you know, non moving activities, laying, standing, and sitting.

And so, beyond that, things are a little bit jumbled together, you

can there's a lot of turquoise on the left and so that's.

That's clearly one activity, but in the

blue and the kind of magenta kind of mixed together.

6:35

And so,

a cluster based on maximum acceleration seems to separate out moving

from non moving, but then once you get within those clusters.

For example, within the moving cluster or

within the non moving cluster, um,then it's

a little bit hard to tell what is what, based just on maximum acceleration.

6:54

Um,we can try a little singular,singular value decomposition

on this data, just to explore what's going on.

Now before I do the SVD, I'm going to do

it on the entire matrix, which is 560 something um,columns.

I'm going to remove the last two, the last two columns are just

the activity identifier and the subject

identifier which are not real interesting data.

So I, I get rid of the five, the columns 562

and 63 and then I run the SVD on the data.

7:19

And you can see,

I'll take a look at the first and the

second left singular vectors and color code them by activity.

And again, you can kind of see there's a similar type of pattern.

The first singular vector really seems to

separate out the moving from the non moving.

So you can see that there's a, a kind of a green, red, black on the bottom.

And the blue, turquoise, magenta on the top.

7:41

And then the sec, the second singular vector's a little bit somewhat a

little bit more vague, what it's looking at.

It seems to be separating out The magenta color from all the other clusters

and so I think this is the walking down, or walking up one of those two.

And so it's not clear what is different about that, that it

kind of highlights, that gets highlighted on the second singular vector here.

8:19

is kind of, is, is kind of producing the most variation, or is

contributing to the most So the

variation between the various, the different observations.

And so we

can, we can, we can use the which.max function to figure

out okay, which of the 500 or so features corresponds to

the, the, the kind of largest, or contributes most of the

variations across observations, and I say that to an object called maxContrib.

And then I'll cluster based on the maximum

acceleration plus this extra feature and I'll, and I'll

calculate the distance matrix to run the h plus function and you can see now the kind

of various activities seem to be separating out a little bit

more, at least the three movement activities have clearly been separated.

We've got the magenta, the dark blue and the turquoise all

separated out the various non moving activities seem to be all kind

of mixed together too so the, whatever this maximum contributor happened to

be it didn't really help to separate out the non moving activities.

But it seemed to help a lot in terms of separating out the movement activities.

9:25

So, this max contributor was the body acceleration, the mean

body acceleration in the frequency domain for the z direction.

And so this was a, kind of the, the body acceleration.

For the z direction where they applied and you transform

and they give you the kind of frequency components from that.

So that's kind of interesting.

We can try another clustering technique here which is K-means clustering.

Ah,and one

of the things about k-means clustering that you have

to be a little bit careful about is that you

can get kind of different answers depending on, you

know how many times,starting values you've tried and how and

how often you run it so whenever you, when

you start k-means it has to chose a starting point

for where the cluster centers are often it will

just chose, most algorithms will chose a random starting point.

So if you chose a random starting point

you may get to a solution that is suboptimal.

So if you chose a different starting point you may get

to an even better solution.

And so it's usually good to set the nstart argument to be more than one so you can

start at many different starting points, just so you

can get the optimal, or, a more optimal solution.

So here is one clustering that we've done with k-means.

And you can see that the, I've specified six

centers, so I know that there are six clusters.

So I'll just specify them right away.

And you can see that the,

some of the clusters kind of jumble together.

So you can see cluster three is

a combination of laying, sitting, and standing.

Whereas cluster one is walking, cluster, clearly walking.

Cluster two is walking down.

Cluster four is walking up. Cluster five is just walking.

And again, and cluster six is a mixture of laying, sitting and standing.

And so you can see there, k-means here had a little bit, had trouble separating

out also the laying, sitting and standing from

the, the three, the in, in, in the clusters.

11:13

If you try it again, you can see the arrangement's a little bit different.

But again, cluster two for example It's a mixture of

laying, sitting and standing, cluster five similarly a mixture of sitting

and standing, but some of the, but the other clusters

seem to, the other activities seem to cluster out very, easily.

11:38

You see that things seem to separate out a

little bit better, not much better than last time.

You can see cluster one is a mixture again of laying, sitting, and standing.

Cluster two is clearly laying.

Cluster three is clearly walking and cluster four is walking down and

so you can see how these things kind of cluster together and

I'll do a second try with 100 starting values.

And you see, this is going to, probably going to be our best effort.

And cluster six still is a mixture of three

activities, and cluster five is a mixture of two.

So you can see kind of, can see where the kind of cluster centers are.

And the idea is that each of the clusters Has a

mean value or a center in a, in this 500 dimensional space.

And so we can see kind of which features of these 500

features seem to drive the location of the center for that given cluster.

And then, that will help us, help give us some idea of you know what features.

Seem to be important for classifying people in

that cluster, or classifying observations in that cluster.

So for in the first cluster here, which seems to correspond to laying, you can see

that the center has a, a relatively high value for a high, or positive values for

13:05

is, corresponds a little bit more, has, has some more interesting values for

other Features so there's mean by

this mean acceleration there's also max acceleration

that seems to have a kind of subinteresting values.

So one of the things that you can do by looking at the

cluster centers is to see well what

features seem to have interesting values that

kind of drive the location to that center And, which could give you a

hint, in terms of what features will be most useful for predicting that activity.

So this is a just a short demonstration to show how you can

take a large data set with lots of features and lots of observations.

And start to explore it a little bit with various clustering techniques.

We use Hierarchical clustering, use k-means

clustering, and we use the singular

value composition to look at various features of, of this data set.

So given what we've learned here, we may want to be interested in

following up on kind of what's separates out the various non movement activity.

So in terms of laying, sitting, and standing, you know, we seem to

have some difficulty At least on the

first glance, separating those three activities out.

The movement activities in terms of walking.

Walking up and walking down.

We seem to be able to kind of separate those out into separate clusters.

Usually just a few variables most of them max accelerations variables.

But the non movement kind of activities seem to harder to separate out.

So, the nice thing about the exploratory data analysis is that it gives

you this rough cut, that tells you kind of where to spend your energy.

So, you probably don't, may not have to spend too much

energy on the movement activities, but maybe you need to spend, look,

dig a little bit deeper looking at the kind of non movement activities.

So I hope you find this useful in terms of how to get started using clustering

techniques and how to get a look at the data and and,and kind of further your

analysis and,and to kind of get you going for ah,more formal analysis.