Welcome to week six. In this week,

we'll talk about non-parametric methods

and especially interesting would be the Gaussian process.

Let's start with seeing what are non-parametric methods.

We all know what parametric methods are.

We define a model that depends on

some parametrics theta and then we find optimal values for theta by

maximizing the maximum a posteriori or maybe taking the maximi destination.

We can fit a linear model for our data.

We'll have parameters A and B, in this case.

If the data becomes more complex,

we can add more parameters.

For example, we can fit a parabola.

In this case, we'll have three parameters,

but as data becomes more and more complex,

we need to add more and more parameters.

For example, in this case, I had to add

eight parameters and to fit a paranorminal of eighth degree.

Another case are non-parametric methods.

Non-parametric methods are the number of parameters depend on the data set size.

That is, as the number of data points increases,

the decision boundary becomes more and more complex.

In parametric methods, though,

the number of parameters are fixed and it doesn't matter how much data do we have.

One non-parametric method that you should know is K-nearest neighbors.

In this case, we do a prediction by finding K,

in this case five nearest neighbors depending on x,

and then the prediction will be the average of the target values for neighboring points.

This would be, for example, 1/5,

sum over the targets of the five nearest neighbors for

the points and the prediction would look as the red curve here.

It isn't smooth. To make it smoother,

we could use circles and Nadaraya-Watson regression.

In this case, we weigh the points by the distance.

I would like to say that the point that are really neighbor to

our point have the highest weight and the points that are far away have lower weights.

This can be written as follows.

It is Y of X, prediction at the point X

equals to the sum over all points in our data set.

WI of X, this is the weight of the point times the YI,

the target of the highest point.

The weight can be computed as the kernel function of X,

the point where we want to predict and XI,

where XI is the position of the highest point.

We should also ensure that the weight sum up to 1 and so we have to

divide the kernel by the sum over all points.

We can see different rows for the kernels.

The most popular one is so-called Gaussian kernel.

In Gaussian kernel, we measure the similarity,

the kernel of X1 and X2,

as the exponent of minus one or two sigma squared,

where sigma is a parameter of the kernel,

times the squared distance between the points and so the plots would look like this.

If we take a higher value of sigma,

the values would drop slower.

You would weight further points a bit higher

and if sigma is low then the kernel would quickly drop to zero.

We could also use a uniform kernel.

For the uniform kernel,

we equally weight the points that have

distance at most age. So, you'll see what we've seen.

There are parametric methods and non-parametric methods.

In parametric methods, you have some fixed number of

parameters and so the complexity is limited.

However, in non-parametric methods,

the decision boundary becomes more and more complex as number of data points increases.

We can say that roughly speaking,

the model is going to be obviously complex as the number of data points goes to infinity.

For parametric methods, the inference is quite fast.

So, for example, for linear regression if you

feed the weights then the prediction would be

just the scalar multiplication between the new point and the weight vector.

For non-parametric method, though,

you'll have to process all the data points to make a prediction.

For example, in Nadaraya-Watson algorithm you had to compute

the weights for all the data points.

In parametric methods, the loading is sometimes slow.

For example, training a neural network may take you hours or

even days while for non traditional methods,

the training is usually just remembering all the data.

It is actually a bit not true since for k-nearest neighbors,

for example, we can pre-computer some information to make a prediction a bit faster.

However, for most cases this is just the case,

you have to simply remember all the data.

So, we've seen what non-parametric methods are,

and the one method that we're particularly interested in are Gaussian process.

Those are aggression models that

we'll be able to estimate uncertainty of the predictions,

which is a very desirable property for complications like medicine.

We'll see in the next video what the Gaussian process are.