0:00

In this video I am going to define what is probably the most common type of machine

learning problem, which is supervised learning. I'll define supervised learning

more formally later, but it's probably best to explain or start with an example

of what it is and we'll do the formal definition later. Let's say you want to

predict housing prices. A while back, a student collected data sets from the

Institute of Portland Oregon. And let's say you plot a data set and it looks like

this. Here on the horizontal axis, the size of different houses in square feet,

and on the vertical axis, the price of different houses in thousands of dollars.

So. Given this data, let's say you have a friend who owns a house that is, say 750

square feet and hoping to sell the house and they want to know how much they can

get for the house. So how can the learning algorithm help you? One thing a learning

algorithm might be able to do is put a straight line through the data or to fit a

straight line to the data and, based on that, it looks like maybe the house can be

sold for maybe about $150,000. But maybe this isn't the only learning algorithm you can

use. There might be a better one. For example, instead of sending a straight

line to the data, we might decide that it's better to fit a quadratic

function or a second-order polynomial to this data. And if you do that, and make a

prediction here, then it looks like, well, maybe we can sell the house for closer to

$200,000. One of the things we'll talk about later is how to choose and how to

decide do you want to fit a straight line to the data or do you want to fit the

quadratic function to the data and there's no fair picking whichever one gives your

friend the better house to sell. But each of these would be a fine example of a

learning algorithm. So this is an example of a supervised learning algorithm. And

the term supervised learning refers to the fact that we gave the algorithm a data set

in which the "right answers" were given. That is, we gave it a data set of

houses in which for every example in this data set, we told it what is the right

price so what is the actual price that, that house sold for and the toss of the

algorithm was to just produce more of these right answers such as for this new

house, you know, that your friend may be trying to sell. To define with a bit more

terminology this is also called a regression problem and by regression

problem I mean we're trying to predict a continuous value output. Namely the price.

So technically I guess prices can be rounded off to the nearest cent. So maybe

prices are actually discrete values, but usually we think of the price of a house

as a real number, as a scalar value, as a continuous value number and the term

regression refers to the fact that we're trying to predict the sort of continuous

values attribute. Here's another supervised learning example, some friends

and I were actually working on this earlier. Let's see you want to look at

medical records and try to predict of a breast cancer as malignant or benign. If

someone discovers a breast tumor, a lump in their breast, a malignant tumor is a

tumor that is harmful and dangerous and a benign tumor is a tumor that is harmless.

So obviously people care a lot about this. Let's see a collected data set and suppose

in your data set you have on your horizontal axis the size of the tumor and

on the vertical axis I'm going to plot one or zero, yes or no, whether or not these are

examples of tumors we've seen before are malignantâwhich is oneâor zero if not malignant

or benign. So let's say our data set looks like this where we saw a tumor of this

size that turned out to be benign. One of this size, one of this size. And so on.

And sadly we also saw a few malignant tumors, one of that size, one of that

size, one of that size... So on. So this example... I have five examples of benign

tumors shown down here, and five examples of malignant tumors shown with a vertical

axis value of one. And let's say we have a friend who tragically has a breast

tumor, and let's say her breast tumor size is maybe somewhere around this value. The

machine learning question is, can you estimate what is the probability, what is

the chance that a tumor is malignant versus benign? To introduce a bit more

terminology this is an example of a classification problem. The term

classification refers to the fact that here we're trying to predict a discrete

value output: zero or one, malignant or benign. And it turns out that in

classification problems sometimes you can have more than two values for the two

possible values for the output. As a concrete example maybe there are three

types of breast cancers and so you may try to predict the discrete value of zero,

one, two, or three with zero being benign. Benign tumor, so no cancer. And one may

mean, type one cancer, like, you have three types of cancer, whatever type one

means. And two may mean a second type of cancer, a three may mean a third type of

cancer. But this would also be a classification problem, because this other

discrete value set of output corresponding to, you know, no cancer, or cancer type

one, or cancer type two, or cancer type three. In classification problems there is

another way to plot this data. Let me show you what I mean. Let me use a slightly

different set of symbols to plot this data. So if tumor size is going to be the

attribute that I'm going to use to predict malignancy or benignness, I can also draw

my data like this. I'm going to use different symbols to denote my benign and

malignant, or my negative and positive examples. So instead of drawing crosses,

I'm now going to draw O's for the benign tumors. Like so. And I'm going to keep

using X's to denote my malignant tumors. Okay? I hope this is beginning to make

sense. All I did was I took, you know, these, my data set on top and I just

mapped it down. To this real line like so. And started to use different symbols,

circles and crosses, to denote malignant versus benign examples. Now, in this

example we use only one feature or one attribute, mainly, the tumor size in order

to predict whether the tumor is malignant or benign. In other machine learning

problems when we have more than one feature, more than one attribute. Here's

an example. Let's say that instead of just knowing the tumor size, we know both the

age of the patients and the tumor size. In that case maybe your data set will look

like this where I may have a set of patients with those ages and that tumor size and

they look like this. And a different set of patients, they look a little different,

7:15

whose tumors turn out to be malignant, as denoted by the crosses. So, let's say you

have a friend who tragically has a tumor. And maybe, their tumor size and age

falls around there. So given a data set like this, what the learning algorithm

might do is throw the straight line through the data to try to separate out

the malignant tumors from the benign ones and, so the learning algorithm may decide

to throw the straight line like that to separate out the two classes of tumors.

And. You know, with this, hopefully you can decide that your friend's tumor is

more likely to if it's over there, that hopefully your learning algorithm

will say that your friend's tumor falls on this benign side and is therefore more

likely to be benign than malignant. In this example we had two features, namely,

the age of the patient and the size of the tumor. In other machine learning problems

we will often have more features, and my friends that work on this problem, they

actually use other features like these, which is clump thickness, the clump thickness of

the breast tumor. Uniformity of cell size of the tumor. Uniformity of cell shape of

the tumor, and so on, and other features as well. And it turns out one of the interes-,

most interesting learning algorithms that we'll see in this class is a learning

algorithm that can deal with, not just two or three or five features, but an infinite

number of features. On this slide, I've listed a total of five different features.

Right, two on the axes and three more up here. But it turns out that for some learning

problems, what you really want is not to use, like, three or five features. But

instead, you want to use an infinite number of features, an infinite number of

attributes, so that your learning algorithm has lots of attributes or

features or cues with which to make those predictions. So how do you deal with an

infinite number of features. How do you even store an infinite number of

things on the computer when your computer is gonna run out of memory. It

turns out that when we talk about an algorithm called the Support Vector

Machine, there will be a neat mathematical trick that will allow a computer to deal

with an infinite number of features. Imagine that I didn't just write down two features

here and three features on the right. But, imagine that I wrote down an infinitely long list, I

just kept writing more and more and more features. Like an infinitely long list of

features. Turns out, we'll be able to come up with an algorithm that can deal with

that. So, just to recap. In this class we'll talk about supervised

learning. And the idea is that, in supervised learning, in every example in

our data set, we are told what is the "correct answer" that we would have

quite liked the algorithms have predicted on that example. Such as the price of the

house, or whether a tumor is malignant or benign. We also talked about the

regression problem. And by regression, that means that our goal is to predict a

continuous valued output. And we talked about the classification problem, where

the goal is to predict a discrete value output. Just a quick wrap up question:

Suppose you're running a company and you want to develop learning algorithms to

address each of two problems. In the first problem, you have a large inventory of

identical items. So imagine that you have thousands of copies of some identical

items to sell and you want to predict how many of these items you sell within the

next three months. In the second problem, problem two, you'd like-- you have lots of

users and you want to write software to examine each individual of your

customer's accounts, so each one of your customer's accounts; and for each account,

decide whether or not the account has been hacked or compromised. So, for each of

these problems, should they be treated as a classification problem, or as a

regression problem? When the video pauses, please use your mouse to select whichever

of these four options on the left you think is the correct answer. So hopefully,

you got that this is the answer. For problem one, I would treat this as a

regression problem, because if I have, you know, thousands of items, well, I would

probably just treat this as a real value, as a continuous value. And

treat, therefore, the number of items I sell, as a continuous value. And for the

second problem, I would treat that as a classification problem, because I might

say, set the value I want to predict with zero, to denote the account has not been

hacked. And set the value one to denote an account that has been hacked into. So just

like, you know, breast cancer, is, zero is benign, one is malignant. So I

might set this be zero or one depending on whether it's been hacked, and have an

algorithm try to predict each one of these two discrete values. And because there's a

small number of discrete values, I would therefore treat it as a classification

problem. So, that's it for supervised learning and in the next video I'll talk

about unsupervised learning, which is the other major category of learning algorithms.