0:00

In this lecture, we're going to continue the

Â data analysis example that we started in part one.

Â If you recall, we, we laid down, kind of a list of

Â te, of steps that generally one might take when doing a data analysis.

Â And previously we talked about the first roughly half of these steps.

Â And in this lecture, we're going to talk about the remaining half.

Â So this includes exploratory data analysis, statistical

Â prediction and modeling, interpretation, challenging your results, synthesizing

Â and writing up the results, and creating reproducible code.

Â 0:30

So if you recall, the basic question was, can

Â I automatically detect emails that are SPAM or not?

Â And a more slightly, more concrete version of this

Â question that can be used to translate into a Cisco

Â problem was, you know, can I use quantitative characteristics

Â of the emails to classify them as SPAM or HAM?

Â 0:50

So, our data set, again, was, from

Â this, UCI machine, Learning Repository, which had already

Â been cleaned up, and it was available in the, current lab package as a data set.

Â So this data set had 4,600, observations, or emails,

Â that had been kind of characterized along 58 different variables.

Â 1:12

so, the first thing that we need to do with this data set if

Â we want to build a model to kind of, classify emails into spam or not.

Â Is that we need to split the data set into test set and a training set.

Â So the idea is that we're going to use part of the test

Â of the data set to build our model, and then we're going to

Â use another part of the data set which is independent of the first

Â part to actually determine how good our model is kind of making a prediction.

Â So here I'm

Â a taking a random half of the data set, so

Â I'm using, I'm flipping a coin with the rbinom function, to

Â generate a random kind of coin flip with probability of

Â half so that'll separate the the data set into two pieces.

Â So you can see that roughly 2000, so 2314 are going

Â to be one half and 2287 will be in the other half.

Â And so the training set will be, will be, one set and

Â the test set will be another set of data.

Â 2:08

So the first thing we're going to want to

Â do is a little bit of exploratory data analysis.

Â We have not looked at this data set yet.

Â And so it would be useful to look at kind

Â of what are the, what data, what did the data

Â look like, what's the distribution of the data, you know

Â what what are the relationships between the variables, things like that.

Â So we want to look at basic summaries one

Â dimensional, two dimensional summaries of the data we want to

Â check for is there are any missing data, you

Â know why is there missing data, if there is create

Â some exploratory plots and do a little kind of exploratory analyses.

Â So so if we look at the training data sets,

Â so that's what we're going to focus on right now as

Â we do our exploratory analysis, as we build our model,

Â all that's going to be done in the training data set.

Â And if you look at the, the column names of the dataset, you can see that they're

Â all just words essentially and and if you look

Â at the first five rows, we can see that

Â basically that these are the frequencies at which they occur in a given email.

Â So you can see, you can see the work make does not appear in

Â that first email and, and the word mail does not appear, so things like that.

Â So these are all basically frequency counts, or

Â frequencies of, of words within each of the emails.

Â 3:21

So if we look at the training data set, and look at the outcome

Â we see that 906 of, of the emails are spam, are classified as spam.

Â And the other 1381 are classified as non-spam.

Â So these, this is what we're going to use to

Â kind of build our model for predicting the spam emails.

Â 3:39

We can make some plots and we can compare, you know, so what are

Â the frequen, the frequencies of certain characteristics

Â between the spam and the non spam emails.

Â So, here we're looking at a variable called capital ave.

Â So the average number of capital letters.

Â And, you can see that its difficult to look

Â at this picture, because the data are highly skewed.

Â And so, in these kinds of situations it's often useful to

Â just kind of look at the log transformation of the variable.

Â So, here I'm going to to take the base ten log of the data

Â set, or, I'm sorry, the variable, and compare them to spam and nonspam.

Â And since there are a lot of zeros in

Â this particular variable, taking the log of zero doesn't

Â really make sense.

Â So we'll just add 1 to that variable, just so we can take the

Â log and kind of get a rough sense of what the data look like.

Â Typically, you don't, you wouldn't want to just add 1 to a variable just because.

Â But since we're just exploring the data, a, like, making, kind

Â of, exploratory plots, it's okay to do that in this case.

Â So here you can see, rather obviously, that, the

Â spam emails have a much higher rate of these,

Â capital letters, than the non spam emails, and of course, if

Â you've ever seen spam emails, you're probably familiar with that phenomenon.

Â And so that's one useful, relationship to see there.

Â 4:53

We can look at pairwise relationships

Â between the different variables in the plots.

Â And here I, I've got a pairs plot of a few of the

Â variables, and as this is the log transformation of each of the variables.

Â And you can see that some of them are correlated,

Â some of them are not particularly correlated, and so that's useful to know.

Â 5:12

So we can explore the predictors space a

Â little bit more by doing a hierarchical cluster analysis

Â and so this is a first cut at trying to do that with the hclust function in R.

Â And you can see I plotted the Dendrogram just to, to see kind

Â of how what, what predictors or what

Â words or characteristics tend to cluster together.

Â And it's not particularly helpful at this point although

Â it does separate out this one variable capital total.

Â But if you recall that the clustering algorithms can

Â be sensitive to any skewness in the distribution of the individual variables.

Â So it may be useful to redo the

Â clustering analysis after a transformation of the predictor space.

Â 5:49

So here I've taken a log, a base 10 log

Â transformation of the fifth, of the predictors in the training

Â data set, and again, I've added one to each one,

Â just so, to make, to avoid taking the log of zero.

Â And now you can see it's a little bit more interesting,

Â the dendrogram that is, it's separated out a few clusters wi-,

Â this capital average is one kind of cluster all by itself.

Â There's another cluster that cludes, that includes you will or your.

Â And then there are a bunch of other

Â words that kind of lump more ambiguously together.

Â And so this may be something worth exploring a little bit

Â further if you see some particular kind of characteristics that are interesting.

Â 6:27

So once we've done exploratory data analysis, we've looked

Â at some univariate and bivariate plots, we did a

Â little cluster analysis, we we can move on to

Â doing a more sophisticated statistical model and some prediction modeling.

Â And so any statistical modeling that you engage in should be informed by you know,

Â kind of question that you're interested in, of

Â course, and the results of any exploratory analysis.

Â The exact methods that you employ will depend

Â on, you know, the question of interest.

Â 6:55

And when you do a statistical model, you should account for the fact that

Â the data have been processed or transformed

Â if they have, in fact, been so.

Â And when you, as you do statistical

Â modeling, you should always think about, what

Â are the measures of uncertainty, what are

Â the sources of uncertainty in your data set.

Â 7:13

So here we're going to just do a very basic statistical model.

Â What we're going to do is we're going

Â to go through each of the variables in the data

Â set and try to fit a generalizing model, in this case

Â a logistic regression, to see if we can predict an

Â email is spam or not by using just a single variable.

Â So here using the reformulate function to create a formula that

Â includes the response, which is just the type, type of email.

Â And one of the variables of the data set, and we're just going to cycle through

Â all the variables in this data set using

Â this for-loop to build a logistic regression model.

Â and, and then subsequently calculate the cross validated error

Â rate of predicting spam emails from a single, variable.

Â And so, if you run this loop in R, it may take a little bit to

Â run, it won't, but if it has to

Â loop through all the variables, [INAUDIBLE] all the models.

Â So, once we've done this, we're going to try

Â to figure out, well, which of the individual variables,

Â has the minimum cross validated error rate.

Â And so we can just go, and you can take this vector of

Â results this CV error, and just figure out which one is the minimum.

Â And it turns out that the, the predictor that has the

Â minimum cross validated error rate is this variable called char dollar.

Â This is an indicator of the number of dollar signs in the email.

Â 8:29

So, just keep in mind this is a very simple model.

Â Each of these models that we fit only have a single

Â predictor in it.

Â So of course we could maybe think of something

Â more complicated, but this maybe an interesting place to start.

Â 8:42

So, if we take this best model from this set of 55 predictors,

Â this, this char dollar variable and I'll just re-fit the model again right here.

Â And so this is a logistic regression model.

Â We can actually make predictions now from the model on the test data recall that we

Â split the data set into two parts and

Â built the training model on the training data set.

Â And so now we're going to predict the outcome on

Â the test data set to see how well we do.

Â And so, in a logistic regression we don't get

Â we don't get specific predictions out of you know 0

Â 1 classifications of each of the messages we get a

Â probability that a message is going to be spam or not.

Â And so then we have to take this

Â continuous probability, which ranges between 0 and 1,

Â and, and determine at what point, at what

Â cutoff, do we think that the email is spam.

Â And so we're, we're just going to draw the cut off here at 0.5,

Â so if the probability is above 50%, we're just going to call it a spam email.

Â 9:43

So once we've created our classification, we can take a

Â look at the predicted values for, from our model, and then

Â compare them with the actual values from the test data set,

Â because we know what, which was spam, and which was not.

Â And here's the classification table that we get

Â from the predicted and the the real values.

Â And we can, so we can just calculate the error rate.

Â And so the, the mistakes that we made are on the off diagonal

Â elements of this table, so 61 and 458. So, 61 were classified as spam, that were

Â not actually spam, and 458 were classify as non spam but actually were spam.

Â So we calculate this error rate as about 22%.

Â So, now that we've done the analysis, we've calculated some results.

Â We've calculated our kind of our best model.

Â We've looked at the error rate that's produced

Â by that model.

Â 10:35

So now we need to interpret our findings and it's

Â important when you interpret your findings to use appropriate language.

Â And to not be to not use language

Â that goes beyond the analysis that you actually did.

Â And so you want to give kind of, if you're in this type

Â of application where we're just looking at

Â some data, we're building a predictive model.

Â You want to use works like, you know, prediction or it correlates with

Â or, or, or certain variables may be associated with the outcome or

Â the analysis is descriptive, and so and so just to think

Â about carefully what kind of language you use to interpret your results.

Â it's, it's good to give an explanation, so if

Â you can think of, you know, why certain models predict

Â better than others, it would be useful to kind

Â of give an explanation of what you think that is.

Â If there are coefficients in the model that you

Â need to interpret it's useful, you can do that here.

Â And in particular it's useful to

Â bring in measures of uncertainty, to kind

Â of calibrate your interpretation of the final results.

Â 11:32

So, in this example, we might think, you know,

Â that you might think of, of stating that, you

Â know, the fraction of characters that are dollar signs,

Â can be used to predict if an email spam.

Â 11:42

Maybe we decide that anything more, with more

Â than 6.6% dollar signs is classified as spam.

Â More dollar signs always

Â means more spam under our prediction model.

Â And, and in our for our model in the test data set, the error rate was 22.4%.

Â So, once you've done your analysis and you've developed your interpretation,

Â it's important that you, yourself, challenge

Â all the results that you've found.

Â Because if you don't do it, someone else is going to do it once they see your

Â analysis, and so you might as well get one

Â step ahead of everyone by doing it yourself first.

Â And so it's good to challenge everything, every, the

Â whole process by which you gone through this problem.

Â The question itself is that, is the question even a

Â valid question to ask where the data came from, how

Â you got the data, how you processed the data, how

Â you did the analysis and any conclusions that you drew.

Â 12:57

And also it's useful to think potential alternative analyses

Â that, you know, might be useful it doesn't mean

Â that you have to do those alternative analyses, in

Â the sense that you might stick to your original

Â just because other reasons.

Â But it may be useful to try alternative analyses just in case

Â they may be useful in different ways or may produce better predictions.

Â 13:20

Once you've interpreted your results, you've done the

Â analysis, you've interpreted your results, you've drawn some

Â conclusions, you've challenged all your findings you're going to

Â need to synthesize the results and write them up.

Â So synthesis is very important because typically in any data analysis,

Â there are going to be many, many, many things that you did.

Â And when you present them to a, another person or to a group you're going to

Â want to have to winnow it down to the kind of most important aspects and to, to,

Â to tell a coherent story.

Â And so typically you want to lead with the question that you were trying to address.

Â If people understand the question then they can

Â make, they can draw up a context in

Â their mind, and understand, kind of have a

Â better understanding of the framework in which you're operating.

Â And so that will lead to what kinds of data are necessary, are

Â 14:03

are, are appropriate for this question

Â what kinds of analyses would be appropriate.

Â So you can summarize the analyses

Â as you're telling the story.

Â It's important that you don't include every analysis that you ever did

Â but only if its needed for kind of telling a coherent story.

Â 14:19

It's useful to sometimes keep these analyses in your back pocket though, even

Â if you don't talk about it, because someone may challenge what you've done

Â and it's useful to say well you know we did do that analysis

Â but, it was problematic you know because of whatever the reason may be.

Â 14:34

It's important to order the analysis

Â that you did according to the story that you're telling and often that order

Â is not the same as the order in which you actually did the analysis.

Â So, it's usually not that useful to talk about the analysis that

Â you did kind of chronologically, or the order in which you did

Â them, because the order in which you did them is often very

Â scattered and and and not, and kind of doesn't make sense in retrospect.

Â So talk about the analyses in your, of your data set in the order that,

Â that's appropriate for the story you're trying to tell.

Â 15:04

And when your telling the story or you're presenting to

Â someone or to your group it's, it's useful to include kind

Â of very well done figures so that people can kind of

Â understand what you're trying to say in one picture or two.

Â 15:17

So, in our example, the basic question was you know, can we

Â use quantitative characteristics of the emails to classify them as spam or ham.

Â Our approach was you know rather than try to

Â get the ideal data set from all Google servers as

Â we collected some data from the UCI machine learning repository

Â and created training and test sets from this data set.

Â We explored some relationships between the various predictors.

Â We decided to use a logistic regression

Â model on the training set and chose our kind of single variable predictor by using

Â cross validation when we applied this model to

Â the test set it was 78% ac-, accurate.

Â 15:54

So it, the interpretation of our results

Â was that basically, more dollar signs seemed

Â to indicate an email was more likely to be a spam, and this seems reasonable.

Â We've all seen emails with you know,

Â lots of dollar signs in them trying to sell you something.

Â And so this is kind of both reasonable and understandable.

Â 16:15

Of course, the results were not particularly great as 78% test

Â set accuracy is not that good for most prediction types of algorithms.

Â That we probably could do much better if we

Â included more variables or if we did or we included

Â a more sophisticated model, maybe a non-linear model and

Â for example is not, why did we use logistic regression?

Â We could have used a much more sophisticated type of modeling approach.

Â But anyway these are the kinds of things that you want to outline to

Â people as you go through data analysis, and you present it to other people.

Â So finally, the, the thing that you want to make sure of

Â is that you, is that you document your analysis as you go.

Â You can use things like tools like R Markdown and Knitter and

Â R Studio to kind of document your analyses as you do them.

Â And so you can preserve the R code

Â as well as any kind of a written summary

Â of your analysis in a single document using Knitter.

Â And so then, and so to make sure that

Â all of what you do is reproducible by either yourself

Â or by other people because ultimately that's the standard by

Â which most kind of, big data analysis will be judged.

Â If someone can not reproduce it then the conclusions that

Â you draw will be, will be you know not as worth,

Â as worthy as one analysis where the results are reproducible.

Â So try to stay organized.

Â Try to kind of, use the tools

Â reproducible research to keep things organized and reproducible.

Â And and so ,and that will make

Â your evidence for your conclusions much more powerful.

Â