0:00

Now, let's talk about how to train an SVM regression model,

that is, how to find its tunable parameters.

We ended up with this problem as an optimization problem of

the convex objective function that we saw in the last video,

where optimization has to be done under inequality constraints.

There exists a very useful mathematical technique called Karush-Kuhn-Tucker or KKT,

concerning the optimization for such sort of thoughts.

So, let's talk about the KKT optimization before we go

back to our main task of optimizing the SVM objective function.

In this problem, we want to minimize some function, f of x,

where x belongs in n-dimensional space of real numbers.

The optimization is constrained by the requirement that another function,

g of x, is non-negative at the solution point.

The KKT approach is to solve this problem

using a modified method of Lagrange multipliers.

What we do in this method is forming

the so-called Lagrangian function given that by fx minus lambda times gx,

where lambda is a Lagrange multiplier.

This function should be minimized under

the following conditions that are called the KKT conditions.

First, function, g of x,

should be non-negative at an optimal point.

Second, the Lagrange multiplier lambda should be non-negative as well.

And third, the product of lambda and g of x at an optimal point should be zero.

But the last condition assumes that

only one of two scenarios can be realized in the solution.

The first case happens when lambda equals zero,

while g of x is non-negative at the solution.

In this case, we say that the constraint is non-active.

It means that an optimal point in this case is the same

as for the corresponding non-constrained optimization problem.

The second case happens when the opposite occurs, namely,

g of x equals zero at an optimal point at start,

while lambda is larger than zero.

In this case, the constraint is active.

It forces the solution to lie on the boundary

between areas in the input space where function,

g of x, changes sign from the positive to the negative.

You can find more details on the KKT method,

for example, in the book by Bisher.

But here, let's just move on and see how

this technique applies in our case for training of SVM.

Following the KKT method,

we form a Lagrangian function shown here.

Let's see what different terms are here, min.

The first two terms are just the two terms of our initial objective function.

The next term ensures positivity of select variables,

sine_i and sine-star, that is enforced via a Lagrange multipliers,

eta_y, eta_i, and eta_i-star.

The last two terms in force are bent constraints using

another pair of Lagrange multipliers, alpha_i and alpha_i-star.

This objective function should be minimized under

constraints of positivity of Lagrange multipliers alpha,

alpha-star, eta, and eta-star.

It turns out that optimization over w,

b, and eta can be performed in the [inaudible] in this formula.

I show you two available review paper by Smola and Scholkopf on support data regression.

While here, I will only outline steps which

include a final formulation over trading objective for an SVM.

First, we perform analytical minimization with

respect to w and eta in the buffer Grangian,

sometimes called the primal Lagrangian.

Then, we plug solutions found for w and eta back into the primal Lagrangian,

and what remains is only a function of

the remaining Lagrange multipliers, alpha_i and alpha_i-star.

And this is called the dual Lagrangian.

Minimization of such dual Lagrangian is the only numerical party in the SVM train.

This dual Langrangian for our problem is shown here.

And as you can see,

this is a cognacs optimization problem with linear constraints.

This can't be solved using standard cognacs optimization software.

A very important point is that because the resulting problem is convex,

the solution to this problem is unique.

This means that SVM has a unique solution

providing a minimum of its constraint loss function.

Therefore, SVMs are free of problems with local minima that

sometimes create all sorts of headaches for other machine learning methods.

For example, loss functions for neural networks are typically nonconvex,

which may create instabilities and there are small variations of the input data.

On the contrary, in SVM,

a solution is always unique and changes

continuously under continuous changes of the inputs.

I skip the math but you can find in the view by Smola and Scholkopf and

give you the final answer up to the value of a equlal sub b,

if given by this equation for the function f of x.

The intersub_d in this equation can be found using

their KKT conditions as explained in the same review paper.

But I want to draw your attention to three important points with this equation.

First, by the KKT conditions,

coefficients alpha_i and alpha_i-star vanish inside the epsilon tube.

This means that points that are inside of the epsilon tube do not contribute to the sum.

In other words, the solution is a sparse sum over only a subset of inputs,

namely, over only points that are outside of the epsilon tube.

These points are called support vectors in the SVM.

And the extension itself is called a support vector expansion.

In our previous graph,

all points inside of the gray band do not contribute to the resulting linear model,

while all points outside of it are support vectors.

As we said, before this happens because these points get zero penalty

in the loss function due to our use of epsilon in sensitive loss function.

Reliance on such loss function makes SVM stable to noise in data and therefore,

often performs better out of sample where

noise tolerance is important for the ability to generalize well,

as we discussed in our fist course.

Second, the dimensionality of the function representation for f of x depends

only on the number of support vectors but not on the dimensionality of the input space.

And fourth, let's now get the final solution for f of x,

does not depend on x itself.

Instead, it depends on the scalar product of two vectors,

x and one of the train data input x_i.

And this last observation will bring us to a topic of

the next video to where we will go after our traditional test question.