0:00

[NOISE] In this example,

we will see linear regression.

But before we start, we need to define the multivariate and

univariate normal distributions.

The univariate normal distribution has the following probability density function.

It has two parameters, mu and sigma.

The mu is a mean of the random variable, and the sigma squared is its variance.

Its functional form is given as follows.

It is some normalization constant that ensures that this probability

density function integrates to 1, times the exponent of the parabola.

The maximum value of this parabola is at point mu.

And so the mode of the distribution would also be the point mu.

If we vary the parameter mu, we will get different probability densities.

For example, for the green one, we'll have the mu equal to -4, and for

the red one, we'll have mu equal to 4.

If we vary the parameter sigma squared,

we will get either sharp distribution or wide.

The blue curve has the variance equal to 1, and

the red one has variance equal to 9.

The multivariate case looks exactly the same.

We have two parameters, mu and sigma.

The mu is the mean vector, and the sigma is a covariance matrix.

We, again, have some normalization constant, to ensure that the probability

density function integrates to 1, and some quadratic term under the exponent.

Again, the maximum value of the probability density function is at mu,

and so the mode of distribution will also be equal to mu.

In neural networks, for example, where we have a lot of parameters.

Let's note the number of parameters as t.

The sigma matrix has a lot of parameters, about D squared.

Actually, since sigma is symmetric, we need D (D+1) / 2 parameters.

It may be really costly to store such matrix, so we can use approximation.

For example, we can use diagonal matrices.

In this case, all elements that are not on the diagonal will be zero,

and then we will have only D parameters.

An even more simple case has only one parameter,

it is called a spherical normal distribution.

In this case, the signal matrix equals to some scalar times the identity matrix.

Now let's talk about linear regression.

In linear regression, we want to feed a straight line into data.

We feed it in the following way.

You want to minimize the errors, and those are,

the red line is the prediction and the blue points are the true values.

And you want, somehow, to minimize those black lines.

The line is usually found with so-called least squares problem.

Our straight line is parameterized by weights, vector, and w.

The prediction of each point is computed as w transposed times xi,

where xi is our point.

Then, we compute the total sum squares, that is,

the difference between the prediction and the true value square.

And we try to find the vector w that minimizes this function.

Let's see how this one works for the Bayesian perspective.

Here's our model.

We have three random variables, the weights, the data, and the target.

We're actually not interested in modeling the data, so we can write down the joint

probability of the weights and the target, given the data.

This will be given by the following formula.

It would be the probability of target given the weights of the data, and

the probability of the weights.

Now we need to define these two distributions.

Let's assume them to be normal.

The probability of target given the weights and

data would be a Gaussian centered as a prediction that is double transposed X,

and the variance equal to sigma squared times the identity matrix.

Finally, the probability of the weights would be a Gaussian centered around zero,

with the covariance matrix sigma squared times identity matrix.

4:22

All right, so here are our formulas, and now let's train the linear regression.

So we'll do this in the following way.

Let's compute the posterior probability over the weights, given the data.

So this would be probability of parameters given and

the data, so those are y and x.

So using a definition of

the conditional probability,

we can write that it is P (y,

w | X) / P (y | x).

So let's try not to compute the full posterior distribution, but

to compute the value at which there is a maximum of this posterior distribution.

So we'll try to maximize this with respect to the weights.

We can notice that the denominator does not depend on the weights,

and so we can maximize only the numerator, so we can cross it out.

All right, so now we should

maximize P (y, w | X).

And this actually given by our model.

So we can plug in this formula,

this would be P (y | X, w) p (w).

7:04

We can plug in the formulas for the normal distribution and

obtain the following result.

So it will be log of some normalization

constant C1 x exp(-1/2).

So the mean is w transposed x, so

this would be (y- w transposed x),

times the inverse of the covariance matrix.

So it would be sigma squared I inversed,

and finally, y- w transposed x.

And we have to close all the brackets, right?

And in a similar way, we can write down the second term,

so this would be log C2 x exp(-1/2),

and this would be w transposed gamma squared I

inverse w transposed, since the mean is 0.

All right, so we can take the constants out of the logarithm, and

also the logarithm of the exponent is just identity function.

So what we'll have left is minus one-half.

The inverse of identity matrix is identity matrix,

and the inverse of sigma squared is one over sigma squared.

So we'll have something like this.

Y- w transposed x transposed x y- w transposed x.

And finally, we'll have a term-

1 / 2 gamma squared w transposed w.

This thing is actually a norm, so we'll have a norm of w squared.

This is w squared.

And this is also a norm of y-

w transposed x squared.

So we try to maximize this thing, with respect to w.

It will multiply it by- 1 and

also to sigma, times to sigma squared.

We'll count to the minimization problem from the maximization problem.

And finally, the formula would be the norm of this thing squared,

plus some constant lambda that equals to sigma

squared over gamma squared, times norm of the w squared.

And since we multiplied by 1, it is a minimization problem.

So actually, the first term is sum of squares.

So we solved the least squares problem.

And the second term is a L2 regularizer.

And so by adding a normal prior on the weights,

we turned from this quest problem

to the L2 regularized linear regression.

[SOUND]

[MUSIC]