The first type of Supervised Learning goggling from for aggression and classification

that is not neural network is called Support Vector Machines or SVM for short.

SVM can be used for both classification and regression,

and their algorithms have many features in common.

We will be presenting SVM for regression,

which is also sometimes called Support Vector Regression or SVR.

But we would refer to them as SVM for regression and talk about

both versions of SVM for classification and regression as the SVM.

The reason I want to talk mostly about the SVM regression,

instead of SVM classification is two-fold.

First, in my opinion,

regression problems are more common in finance than classification problems.

The other reason is that

SVM classification is a non-probabilistic method that produces class labels,

but not class privileges.

But as we said several times before in what comes to

modelling finance is mostly about a noise and probabilities.

Therefore, if your problem is classification problem,

I would recommend trying probabilistic methods such as logistic regression first.

But if your problem is regression problem,

then non-probabilistic method for regression is fine because it's just equivalent to

a probabilistic model with the Gaussian noise for

a quadratic function as we discussed in our guided tour.

Therefore, we will talk about SVM regression in this lesson.

So what are SVM in general,

and SVM for regression in particular?

SVMs were originally developed by Vladimir Vapnik of AT&T in

1992 as a classification method within the framework of statistical learning theory,

which is a subfield of machinery that

deals with the algorithms that come with that performers guarantees.

In mid-1990s, SVMs were extended to regression by Vapnik et al.

collaborators. SVMs are based on simple and beautiful geometric ideas such as

maximum margin hyperplane classifiers in the case of classification.

For both classification and regression,

SVM amounts to conduct optimization.

Therefore, it leads to an unique solution.

The solution is also deterministic.

So multiple calculations with the same data will always produce the same result.

For regression, SVM can handle both linear and non-linear regression.

Here, there is a substantial difference between how

non-linearity is tackled by neural networks and an SVM.

Within neural networks, non-linearity is implicit via

the choice of architecture activation functions and parameters.

But SVM controls non-linearity view the choice over jornal function.

I will explain whether these in a bit later after I explain the main ideas of the SVM.

It turns out that sometimes SVM may work better than

neural networks for small medium

sized data given in empirical support to the no free lunch theory.

SVMs are widely used for such tasks as text or speech classification,

object identification, bioinformatics tasks, and so on.

It was also used in finance for problems

such as stock price prediction or bankruptcy predictions.

So let's see how the SVM works for regression.

Let's assume for the start that we work with the usual task of linear regression.

We have some inputs X that belong to input pattern space of X cow of dimension B.

We want to find the linear function of X equals to

the scalar product of a parameter a vector W,

and X plus an intercept B.

Here, I use a bracket notation to denote scalar product of a parameter vector w,

and the input vector X.

So far, they sounds the same as linear regression that we discussed in the first course.

But the difference comes in how we penalize errors.

In their regular linear regression,

we used the square law function to penalize all deviations from straight line.

In the SVM, we proceed differently.

We want a function that there is as little as possible

with the data while not analyzing small deviations from a straight line.

Let me illustrate it on the graph that shows how the SVM works in one dimension.

Assume that we define their noise tolerance for

SVM regression by specifying the value of parameter epsilon.

Now, I assume that we have data points as shown on this graph on the left.

Assume that the right models is given by the central line in

the great Paignton bend in the plane of these two times epsilon.

This means that the boundaries of this band called epsilon

cube Y at the distance epsilon from the two solution.

Now, we see that some data points fall within the epsilon tube while some fall outside.

In the SVM approach,

we do not analyze data points that are within the epsilon cube.

So they can be anywhere within the epsilon cube and

the value of the error functions will still be the same.

Clearly, this helps the algorithm to be more noise tolerant.

Close function corresponding to such rules as shown on the figure on the right.

This function is called the epsilon-insensitive for function.

It's zero on the interval of minus epsilon

and so on and grows linearly outside of this interval.

For comparison, a regular square law function will

be presented by a parabola touching the origin on this graph.

Now after we understood the last function for the SVM,

let's come back to the problem of finding optimal parameters for the SVM regression.

So we have two criteria.

We want our function to be as flat as possible, and also,

it meet at most deviations of epsilon from an optimal straight line.

You can formulate it as the condition of minimization of

the normal vector W subject to

an absolute value of the difference

between the prediction and the data to be at most epsilon.

The latter condition can be formulated as two conditions.

The first one is given by the inequality Y sub I minus

dot product of W and vector XI minus B should be less or equal.

The epsilon and the second the inequality

is the same with the left-hand side having flip sign.

So we end up with the problem of minimization of

the norm of W subject to these two constraints.

But these constraints may not be always visible.

As for any given epsilon,

we can still have points outside side of the epsilon cube.

And this is handled by introducing a pair are

no negative slack variables sine IN sine I star for each data point and

modifying constraints to read instead Y sub I minus the product of

W and Xi minus B this time is less to equal than epsilon plus sine I.

And a similar adjustment is made in the second equation.

Now, the new object function is given by the square norm of the vector W,

and sum of Xi and X star or

all data points multiplied by some parameter seat as shown in this equation.

The parameters C determines trade-off between our requirements and

flatness of Y and our tolerance for deviations that are larger than epsilon.

If you see, many points will be outside of

the epsilon cube but the figure line will be nearly flat if C is large.

The line will be more with the data but most of points will be within the epsilon cube.

This function should be minimized subject to our constraints,

and subject to sine I sine star being non-negative.

We will talk about the math of such optimization in our next video.

But first, let's have some Q and A.