In the previous lecture, we have shown the equivalence of a Gaussian dropout with

a special kind of variational base and inference.

So we have proved that Gaussian dropout certainly optimizes

the following ELBO and in this ELBO the second term doesn't

depend on theta so it can be ignored if we optimize only with respect to theta.

And now, the question is why not to optimize

both with respect to theta and alpha because remember that

our variational approximation our q_w depends on theta and also of alpha.

And remember that the more relational parameters we

have because it worked with the two Basuto distribution.

So we will only get our approximation better and better.

So then why not optimize ELBO with respect to theta and alpha?

It's important to know that

it wasn't possible until we came with a Bayesian interpretation of Gaussian dropout.

Really, It will try to optimize with just the first term,

the theta term with respect to both theta and alpha.

We would quickly end up with zero values of alpha. Why so?

Because we know that the maximum value of the first term is achieved when

our distribution is delta function at w_m_o and

delta function means zero variance and zero variance means zero alpha.

So we may obtain some non-zero values of alpha only if we optimize both terms,

the theta term and our regularizer.

So now our variational approximation looks as follows,

so this is fully factorized it goes in distribution with respect to all weights

w_i_j with the mean theta i_j and with variance alpha theta i_j squared.

But we may go even further.

Why not to assign individual to department to each weight?

Why not to say that our variational approximation looks as follows,

so this is fully factorized Gaussian distribution over w_i_j with

the mean theta i_j and with variants alpha i_j times theta i_j squared.

So we may now assign individual department,

individual alpha to each of the weights.

And again, this will make our approximation only tighter.

We only come close to the true posterior distribution.

But before we proceed let us examine the purposes

of our regulizer with dependence on alpha.

Remember that we may approximate it with a small differential function and we see

that the maximum value of this regularizer is achieved when alpha goes to plus infinity.

This means that the second term of our ELBO encourages a larger values of alphas.

And that's quite interesting because we may easily

prove that if alpha j goes to plus infinity

then the corresponding theta i_j which is the mean

in our variational approximation converts to zero.

In such a way that alpha i_j times theta i_j squared also converts to zero.

What this means that our variational approximation,

our q_w_i_j becomes delta function

when alpha j goes to plus infinity and delta function stands at zero.

And delta function stands at zero means that the corresponding w_i_j

is exactly zero and this means that we may simply skip this connection.

Simply remove the corresponding weight from

our neural network thus effectively sparcifying it.

So the whole procedure which is known as a sparse variational dropout looks as follows.

First, we assign log-uniform prior distribution over the weights,

which is fully factorized prior distribution.

Then I fix variational family of distributions q of w given theta alpha.

And again, this fully factorized distribution or o weight w_i_j with

a mean theta i_j and with variants given by alpha i_j times theta i_j squared.

And finally, we perform a stochastic variational inference trying to

optimize our ELBO both with respect to thetas and with respect to all alphas.

And in the end, we remove all weights

whose alphas exceeded some predefined large threshold.

And surprisingly, this procedure works quite well.

So in this picture you see the behavior of convolution kernels from

convolution layers and the fragments of weight matrix from fully connected squares.

So you see that as training progresses the more and more weights,

and the more and more coefficients and convolution kernels converge to zero.

The compression in fact exceeds 200 and pay attention that the accuracy doesn't decrease.

So we're keeping the same accuracy.

This is the baseline while effectively

compressing the whole network in hundredths of times.

This only became possible due to this Bayesian dropout.

So to conclude, it is known that modern deep architectures are very redundant.

But it is quite problematic to remove this redundancy and one of

the most successful ways to do this is

the Bayesian dropout or sparse variational dropout.

Variational Bayesian inference is a highly skilled procedure that allows

to optimize millions of

variational parameters and this

is just one of

many examples of successful combinations of Bayesian methods and deep learning.

More examples of successful application of Bayesian methods for DNNs you may find

in additional reading materials.