0:16

Hello and welcome to the introduction to machine learning lesson.

This lesson will start by introducing four types of

data analytics that are commonly used in business communities.

Next, we'll actually get into the basic tasks of machine learning.

This includes things such as data cleaning and pre-processing.

While they aren't the most exciting aspects of analytics or machine learning,

they're incredibly important because you can't make sense of

data if the data isn't clean and ready to be analyzed.

Finally, we'll introduce the four main categories of machine learning.

These include classification, regression,

clustering, and dimensional reduction.

And we'll also talk about how to persist machine learning models.

This last aspect is important because if you

spend a lot of time building a model you want to be able to

save it so that you can reuse it or deploy it on different computational hardware.

So key things you should be able to do by the end of this lesson are: explain

the difference between supervised and unsupervised learning,

explain the differences between regression and classification,

understand and articulate the basic concepts of clustering and dimensional reduction,

and be able to use the scikit learn library to perform these basic tasks.

Now, a key aspect here is I don't want you to be experts after this lesson.

You're going to have many modules to learn these things in more detail.

This is simply to give you

a high level overview of these concepts so that you'll be able to

understand them and talk about them and be

ready to dig into them more deeply in future lessons.

There's going to be a reading on the four types of

data analytics that are typically encountered in data science as well as a notebook.

So first, let me go to the readings and I'm

going to explode this so it's a little more easy to see.

You've probably seen this example in other cases.

This is not something that's unique to this particular website.

But the idea that there's really four things that you can do with data.

The first is descriptive and that sort of is what's happening right now.

And the idea is you have some data and you're trying to make sense of what's going on.

The second is diagnostic and this is why are the things that I'm seeing happen,

why are they actually happening.

And so the idea is to be able to go beyond what merely is

the descriptive aspect and to try to understand why that's happening.

The third is predictive and that's a little bit more future looking where you say,

"I think I understand what's going on right now,

what's likely to happen in the future."

And lastly is the prescriptive, which is,

what do I need to do to be able to capitalize on

the things that I'm seeing as potential future outcomes?

So this article will go through these in more detail,

I encourage you to look at this,

it's an important idea and there's a lot of good information in here.

The main part of the lesson though is going

to be the introduction of machine learning notebook.

In this notebook, we're going to introduce the scikit-learn library,

which is shortened to sklearn.

And we're going to show how to do some basic steps

in the machine learning process and that includes some data pre-processing,

data scaling and then specifically give examples of classification, regression,

dimensional reduction and clustering before

ending with a demonstration of model persistence.

Now in the interest of time,

I'm not going to step through every single part of this notebook.

I simply want to highlight a few things.

First, you're going to see this particular code sell

or something very similar to it at the start of every notebook.

This of course is our standard setup,

where we import most of the modules we're going to use.

We set some warnings so that we ignore,

in this case, these are specific to pandas.

We sometimes get warnings that they're not important.

And then lastly set something for the visualization.

The first step is data exploration and there actually is

a standard technique that's used called

Cross Industry Standard Process for Data Mining or CRISP-DM.

And the very first step is data understanding,

data preparation, business understanding,

so before you try to apply any fancy machine learning technique,

you need to make sure you understand what it is you're actually trying to do.

This may seem obvious but often people get

excited and want to just run in and try some algorithm

and make some fancy prediction without

understanding the exact details of what they're trying to solve,

and you don't want to solve the wrong problem.

If you think about it in terms of a homework assignment,

you always want to solve what you're being asked for not something else.

So in this notebook, we're going to use the standard Iris data set.

It's not a fancy new data set but it allows us to

show how the algorithms are working without having to constantly change data sets.

We will use several different data sets

throughout the machine learning aspect of this course.

The Iriss one is one we'll use frequently because it demonstrates easily lots of

the concepts that we want to demonstrate with machine learning and so we can do that.

So first, we load the data set,

we then use the sample technique,

I like sample because it just picks five random rows.

And in this case, we've said five so that's why there's

five and it's useful because

every time you run a trick it get's a little something slightly different.

We can also group by the target label,

in this case species, and say how many do we have.

You could see we have 50 of each.

So in this particular case there's 150 rows or instances in our data-set.

There's four features or columns and they're balanced across all.

The other nice thing is we're not missing any data here.

That's nice if we were,

we for instance might see a 49 here or a 48 here.

We could also compute some descriptive statistics with the describe function.

And here again, we see that there's 150 instances in all columns, that's good.

We can then look at making some simple visualizations,

in this case, we're going to use the pair plot or pair grid,

which makes a very fast visualization showing the scatter plots between

different pairs of features and

the diagonal is that a histogram of that particular feature.

We've color coded the points by the particular species.

So looking at this you can see some things very fast.

First of all, there's a really strong relationship between pedal length and pedal width.

That's interesting and something we're going to want to come back to.

Secondly, we notice that if we look at sepal with

versus pedal width that there is a natural clustering

of points here and there's a natural cluster of

green here and red here with just a slight bit of overlap.

These are important things and one reason that I like to make these sorts of

pair plots because it allows you to quickly

visualize the relationships between different features.

And the visualizations are very good at giving you that insight very quickly.

Next, we do a few things here we're specifically pulling out our columns from

our data frame into an array that

will allow us to apply machine learning and then secondly,

we're creating a new array called labels which is going to

be numerical with the value zero, one, two.

And the way those are ordered is that setosa will be zero,

versicolor will be one virginica will be two.

This here is simply integer division.

So we take i for i and range so this is going to be 150.

So we're going to have an array of 150 and

the first 50 elements will all be zero because it's integer division.

And then once we get to 51,

it's 50 and then 51 it's going to be

one until we get to 100 at which point it will turn two.

So this is just a quick way of building up an array

where we have a integer that corresponds to each class.

We can make a quick plot that's we're showing our data,

this is now showing you that clustering that we saw before.

And then, we can start getting into machine learning.

So three key aspects here I want to highlight,

first is the difference between supervised and unsupervised,

that's basically supervised techniques use training data or labels to make predictions,

unsupervised doesn't need that.

The second is that this idea of dimensional reduction is when you have lots of

features you often want to truncate that or use a smaller subset.

Ideally, that smaller subset still encodes a lot of the information,

that's a useful technique.

And then lastly is clustering,

which is really a powerful technique that allows you to find data that

are grouped together somehow and to be able to treat them then as a single entity.

A good use of clustering in business is when you're trying

to do customer segmentation and you're trying to say,

"Here's a bunch of customers that are all similar that are using our particular product."

You may want to find for instance,

high income customers and this might be a way to do that or

customers that you need to worry about losing to competitors.

The rest of the notebook then steps through these.

We talk about a few things like parameters and hyperparameters,

it's important to understand these.

We'll be doing a lot with hyperparameters.

Hyperparameters are something that we have to figure out what

the value is and we can't do it ahead of time.

And so you need to run a model multiple times and see what's the best value.

And then we're going to step into scikit-learn.

In particular, we're going to see things such

as splitting our data set into a training set and a testing set.

If you train your model on data,

and then you use the results of that as the accuracy,

you're going to be susceptible to problems.

You always want to have a set of data that has not been seen by

the model to use for your prediction accuracy that will be more realistic.

And so we want to do this, where we split into a train, test split.

We often also will use a random state here.

The reason I do this is because this allows our notebook to be reproducible.

If I didn't set this,

and we'd use some random state that always changed,

the results would always change.

And that would make it hard to write a notebook and to be able to

convince others of what you're doing because the results constantly change.

Sometimes that's okay, but in general in these notebooks,

I will specify a random state.

We also might want to scale the data.

This is important if you're going to be doing certain algorithms.

If you have a feature that say,

goes from zero to one and another feature that goes from zero to 1000,

it's hard for a machine learning algorithm to treat those the same

way because the one has such much larger range.

So you can scale data in different ways,

and we show some of these here.

You can normalize them to have a unit mean and variance or maybe to have a certain range.

And you typically want to do this,

you want to do it to both the training and testing data,

and so we mentioned that here.

So here, we're implementing the StandardScaler,

which gives us a zero mean and standard deviation of one.

So here's the original data and here's the scaled data,

and you can see how it's changed.

Next, we're going to step into classification.

We simply make a model and we say, "How accurate are we?"

First thing I want you to notice, look how simple it was to do this.

We have our data already taken care of.

We import our particular estimator,

we say, we want five neighbors,

we run it, we fit our model,

and then we get our accuracy on our test data.

And you can see the accuracy it's pretty high, that's pretty impressive.

And this is the beauty of scikit-learn as a library,

it's very simple to apply things.

This is only a few lines of code,

and we create a model,

we train it, and we score it.

Regression, we use a different model,

in this case, a decision tree.

And don't worry about the details of these models.

We're going to have entire lessons devoted to them,

so you'll have plenty of time to dip into that.

Here again, we create our training and test data.

We then apply it, in this case,

to a regression model,

where we are trying to predict a continuous value.

Classification was putting things into bins.

Regression is trying to predict a continuous value.

So for instance, if you try to predict future income for a company,

you would want that as a regression, not a classification.

Then we look through dimensional reduction,

which is trying to reduce the number of features.

We apply this, in this case, to the Iris data.

Then we say, "How many features do we need?"

It turns out we don't need a lot.

And that makes sense if you think about the pair plots.

There are a few combinations of features where the data was highly separated.

And then lastly, we have our clustering.

And here we apply clustering to the data and we get nice little centers.

And that's what this last visualization shows.

What are our computed clusters?

The purple stars show this.

Here's one cluster center,

here's another, and here's a third.

And then the last thing was the model persistence,

which show how to actually save a model.

We can do this by using Python's pickling capability in particular,

with the joblib library.

Very simple, these four lines save the model,

we then show the model right there.

And then we can recreate the model into our notebook and apply it.

And you can see that we still get the same results.

So with that I'm going to go ahead and end.

I realize this was a very long video,

but there were a lot of things that I wanted to get through.

Hopefully, I've given you a taste of the importance of machine learning,

the different types of data analytics,

and in particular, the supervised,

unsupervised learning, classification, regression,

dimensional reduction, and clustering.

We'll be going through all of these in much more detail in future lessons.

If you have any questions,

let us know, and good luck.