0:00
A lot of the action in machine learning has focused on what
algorithms are the best algorithms for
extracting information and using it to predict.
But it's important to step back and look at the entire prediction problem.
This is a little diagram that I made to illustrate
some of the key each, issues in building a predictor.
So you start of with, suppose I want to
predict for these dots whether they're red or blue.
Well, what you might do is have a big group of dots that you
want to predict about, and then you use
probability and sampling to pick a training set.
The training set will consist of some red dots and some blue
dots, and you'll measure a whole bunch of characteristics of those dots.
Then you'll use those characteristics to build what's called a
prediction function, and the prediction function will take a new dot,
whose color you don't know, but using those characteristics that
you measured will predict whether it's red or whether it's blue.
Then you can go off and try to
evaluate whether that prediction function works well or not.
1:03
This is always a required component of building every machine learning algorithm
is deciding which samples you're going to use to build that algorithm.
But sometimes it's over-looked, because all of the action that you hear about for
machine learning happens down here when you're
building the actual machine learning function itself.
1:19
One very high profile example of the ways that this
can cause problems is the recent discussion about Google Flu trends.
Google Flu trend is tried to use the terms that people were typing into
Google, terms like, I have a cough, to predict how often people would get flu.
In other words, what was the rate of flu that was going
on in a particular part of the United States at a particular time?
1:41
And they compared their algorithm to approach taken
by the United States government, where they went out
and they actually measured how many people were
getting the flu, in different places in the US.
And they found in their original paper that the
Google Flu trends algorithm was able to very accurately represent
the number of flu cases that would appear in
various different places in the US at any given time.
But it was quite a bit faster and quite a
bit less expensive to measure using search terms at Google.
The problem that they didn't realize at the time, was that
the search terms that people would use would change over time.
They might use different terms when they were
searching, and so that would affect the algorithm's performance.
And also, the way that those terms were actually
being used in the algorithm wasn't very well understood.
And so when the function of a particular search
term changed in their algorithm, it can cause problems.
And this lead to highly inaccurate results for the Google
Flu trends algorithm half over time as people's internet usage changes.
So this gives you an idea that choosing
the right dataset and that knowing what the specific
question is are again paramount, just like they have
been in other classes of the data science specialization.
So here are the components of a predictor.
You need to start off as always in all, any problem
with data science with a very specific and well defined question.
What are you trying to predict and what are you trying to predict it with?
2:56
Then you go out and you collect the best
input data that you can to be able to predict.
And from that data you might either use measured
characteristics that you have or you might use computations
to build features that we'd think you might be
useful for predicting the outcome that you care about.
At this stage then you can actually start to use the machine learning
algorithms you may have read about, such as Random Forest or Decision Trees.
And then what you can do is estimate the
parameters of those algorithms, and use those parameters to
apply the algorithm to a new data set and
then finally evaluate that algorithm on that new data.
So I'm going to just show you one quick little
example, to show you how this little process works.
So this is obviously a trivialized version of what would happen in a
real machine running algorithm, but it gives you a flavor of what's going on.
So you start off with asking something about the question.
So you start with a in general
people usually start with a quite general questions.
So here is, can I automatically detect emails
that are SPAM from those that are not?
So SPAM emails are emails that you got
that you, come from companies that get sent out
to thousands of people at the same time
and that you might not be interested in it.
4:02
So you might want to make your question a little bit more concrete.
You often need to when doing machine learning.
So, the question might be, can I use
quantitative characteristics of those emails to classify them as
SPAM, or what we're going to call HAM which
is the email that people would like to receive?
4:19
So once you have your question, then you need to find input data.
In this case, there's actually a bunch of data
that's available and already pre-processed for us in R.
So it's actually in the current lab
package K-E-R-N-L-A-B and it's the SPAM dataset.
So we can actually load that data set into R directly, and it has some information
that's been collected about SPAM and HAM emails already available to us.
Now we might want to keep in mind that that might
not necessarily be the perfect data, in fact, we don't have all
of the emails that have been collected over time, or we
don't have all the emails that are being sent to you personally.
So we need to be aware of the potential limitations of this
data, when we're using it to build an algorithm, a prediction algorithm.
4:58
Then we want to calculate something about features.
So, imagine that you have a bunch of emails.
And here's an example email that's been sent to me.
Dear Jeff, can you send me the address, so I can send you the invitation.
Thanks, Ben.
If we want to build a prediction algorithm,
we need to calculate some characteristics of
these emails that we can use to be able to build a predictive algorithm.
And so one example might be, we can
calculate the frequency with which a particular word appears.
So here, we're looking for the frequency that the word you appears.
And so in this case, it appears twice in this email so 2 out
of 17 words or about 11% of the words in this email are you.
We could calculate that same percentage for every single email that we have and
now we have a qualitative characteristic that we can try to use to predict.
5:43
So if the data in the current lab package that I've shown here are actually,
information just like that, for every email we
have the frequency with which certain words appear.
And so, for example if credit appears very often in the email or money appears
very often in the email, you might imagine that that email might be a SPAM email.
So, as one example of that, we looked at the frequency
of the word, your, and how often it appears in the email.
And so, I've got a plot here that's a density plot of the, that data.
And so, on the x-axis is the frequency
that with which, your, appeared in the email.
And on the y-axis is the density, or the
number of times the that frequency appears amongst the emails.
And so what you can see is that most of the emails that are SPAM, those are the
ones that are in red, you can see that
they tend to have more appearances of the word, your.
Where as all of the emails that are HAM, the
ones that we actually want to receive have a much higher peak
right over here down near 0, so there's very few
emails that have a large number of viewers that are HAM.
6:49
So, we can build an algorithm in this case let's build a very very simple algorithm.
We can estimate an algorithm where we want to just find a cut off a constant C, where
if the frequency of your is above C then
we predict spam and otherwise we predict that it's ham.
7:05
So going back to our data we can fig, try to figure out what
that best cut off is, and here's an example of a cutoff that you could
choose, so choose a cut off here that if it's above 0.5 then we
say that it's SPAM, and if it's below 0.5 we can say that it's HAM.
And so we think this might work because you can see that
the large spike of blue HAM messages are below that cut off.
Whereas the big, one of the big spikes of the SPAM messages is above that cut off.
So you might imagine that wil cache quite a bit of that SPAM.
So then what we do is we evaluate that.
So what we would do is calculate for
example predictions for each of the different emails.
We take a prediction in that says, if the frequency of yours
above 0.5, then you're spam and if it's below then you're nonspam.
And then we make a table of those predictions and divide
it by the length of the, all the observations that we have.
And so we can say is that, when you're nonspam about
45% of the time, 46% of the time, we get you right.
When you're spam about 29% of the time, we get you right.
So, total we get you write about 45% plus 29% is about 75% of the time.
So our prediction algorithm is about 75% accurate in this particular case.
So that's how we would evaluate the algorithm.
This is of course any same dataset where we actually calculated
it, the prediction function, and as we will see in later lectures.
This will be an optimistic estimate of the overall error rate.
So that's an overview of, the basic steps in building a predictive algorithm.