Okay. In the last video,

we spoke among other things about classification

of supervised learning algorithms into probabilistic and non-probabilistic models.

For probabilistic models, we

distinguished between generative and discriminative probabilistic models.

We also said that

some non-probabilistic models can be equivalently formulated as probabilistic ones.

For example, linear regression is equivalent to a discriminative Gaussian model.

Now, let us talk about probabilistic classification models in finance.

Classification models are obtained from

probabilistic framework if the output variable Y is not a continuous variable,

but rather a discrete label.

This label has values CK,

where k is the number of different classes.

Note that the sum of probabilities over all classes should be one,

obviously, so we get this relation here.

Let us now consider a special case of classification one K equals two.

So, in this case, we have only two states.

We can call them plus and minus or zero and one,

or risky and non-risky and so on.

It does not matter as long as we have only two classes.

So, let's call them for simplicity or convenience,

zero and one, and call examples of class one positive examples.

Then, we have only one and only this problem,

which is the probability of the positive class Y

equal one The probability of the other class will simply be

given by one minus the first probability.

So, binary classification predicts

positive and negative outcomes for any given data sets.

There are tons of problems in finance that fit very neatly into this framework.

For example, such diverse business cases as modeling

mortgage defaults or credit card frauds or corporate defaults,

bank failures, stock price direction,

anti-money laundering, can all be viewed as use cases for binary classification.

Now, I want to talk in more details about

one particular method for binary classification called logistic regression.

To see how it works,

recall the ordinary linear progression.

Linear regression fits a linear model Y equals theta transpose times X,

where X is an N dimensional vector of features and Y is [inaudible] output.

This is not a good framework for a probabilistic approach

obviously because probabilities should be between zero and one.

But what if we squeeze Y by

some nonlinear Scorsone function H that maps real value number into a unit interval?

Then, in this case,

the result can be interpreted as a probability.

One example of such Scorsone function is given by the Sigmoid function.

The Sigmoid function is shown in this expression

and also shown by this blue line on this graph.

As you can see, this function quickly approaches zero for

negative values of its argument and quickly reaches one for positive arguments.

It only changes strongly in the vicinity of zero and takes the value of

point half when its argument is exactly equal zero Now if we apply the Sigmoid function,

also known as a logistic function,

to the output of linear regression and use it as a model for probability of Y equal one,

we obtain logistic regression.

Logistic regression is one of

the simplest probabilistic models for binary classification.

While here we produce It out of thin air by some hand-waving arguments,

it can also be derived from some more fundamental models,

as we will discuss later in this specialization.

Now let's talk about estimation of a logistic regression model.

First, let's compute the likelihood for logistic regression.

If all observed pairs X and Y are independent,

then the total probability to observe all data is given by

the product of individual probabilities for individual events.

Each term in this expression is called the Bernoulli Distribution.

Note that if Y equals one,

this whole expression equals pn of theta.

But if Y equals zero,

it is equal to one minus pn of theta.

Now, by taking the negative log of this expression,

we arrive at the last function for binary classification.

It is given by this expression also known as the [inaudible].

You can explicitly check that this function is a convex function of parameter theta.

Therefore, it has a unique global minimum.

This minimum can be found using

a gradient-based algorithms that we discussed earlier for regression,

but this time applied to a different objective function.

So, in this video,

we met probabilistic classification models and learned how they can be implemented using

neural network architecture such as logistic regression and fifot neural networks.

As we saw changes that are needed for

neural network to make them work for probabilistic models are minimum.

All we need to do is to replace the last output neuron by a Sigmoid neuron.

What is important to understand here is that while

a classification model can be a probabilistic and algorithm to estimate the model,

that is a neural network in our case, is deterministic.

This is because the neural network directly models probabilities of event

as parametric or non-parametric functions of model inputs.

In our next course, in this specialization,

we will talk more about classification algorithms and their performance metrics.

But for this course,

I want to show you in the next video

a very interesting and practically important financial problem

where these methods would come very handy. See you then.