So going back to our data we can fig, try to figure out what

that best cut off is, and here's an example of a cutoff that you could

choose, so choose a cut off here that if it's above 0.5 then we

say that it's SPAM, and if it's below 0.5 we can say that it's HAM.

And so we think this might work because you can see that

the large spike of blue HAM messages are below that cut off.

Whereas the big, one of the big spikes of the SPAM messages is above that cut off.

So you might imagine that wil cache quite a bit of that SPAM.

So then what we do is we evaluate that.

So what we would do is calculate for

example predictions for each of the different emails.

We take a prediction in that says, if the frequency of yours

above 0.5, then you're spam and if it's below then you're nonspam.

And then we make a table of those predictions and divide

it by the length of the, all the observations that we have.

And so we can say is that, when you're nonspam about

45% of the time, 46% of the time, we get you right.

When you're spam about 29% of the time, we get you right.

So, total we get you write about 45% plus 29% is about 75% of the time.

So our prediction algorithm is about 75% accurate in this particular case.

So that's how we would evaluate the algorithm.

This is of course any same dataset where we actually calculated

it, the prediction function, and as we will see in later lectures.

This will be an optimistic estimate of the overall error rate.

So that's an overview of, the basic steps in building a predictive algorithm.