0:00

Hello, companero fans.

In the previous two lectures we made friends with Hebb

and learned that his learning rule implements principal component analysis.

We then learned about unsupervised learning which tells us how

the brain can learn models of its inputs with no supervision at all.

I left you with the question,

how do we learn models of natural images?

What does the brain do?

Well, as the saying goes "When in doubt, trust your eigenvectors."

Well, can we use eigenvectors or

equivalently principal component analysis to represent natural images? Well, let's see.

So here's a famous example from Turk and Pentland.

And what they did was they took a bunch of face images,

and let's say that each of these face images has N pixels.

So they took a bunch of face images and they computed

the eigenvectors of the input covariance matrix.

And when they did that they found that the eigenvectors looked like this,

and they called these eigenvectors "Eigenfaces".

Now we can represent any face image,

such as this one,

as a linear combination of all of our Eigenfaces.

So here is the equation that captures this relationship.

Now why can we do that?

Well, remember that the Eigenfaces are the eigenvectors of the input covariance matrix,

and since the covariance matrix is a real and symmetric matrix,

its eigenvectors form an orthonormal basis with

which we can represent these input vectors.

Now, here's something interesting.

You can use only the first M principle eigenvectors.

So, what do we mean by the first M principle eigenvectors?

Well, these are the eigenvectors associated with

the M largest eigenvalues of the covariance matrix.

So if we use only the first M principle eigenvectors to represent the image,

then we get an equation that looks like this.

And this equation just tells you that there

are some differences between the reconstruction of the image using

only the first M principle eigenvectors and therefore we are going to model

those differences between the actual image and

the reconstructed image using a noise term.

So why is this a useful model?

It's a useful model because you can use it for image compression.

So suppose your input images were of size thousand by

a thousand pixels which means that N is going to equal one million pixels.

And now, if the first,

let's say 10 principle eigenvectors,

are sufficient which means that the first

10 largest eigenvalues are sufficient to explain most of the variance in your data,

then M is going to equal 10 which means that

just 10 numbers are enough to represent an image.

So 10 of these coefficients are sufficient to

represent any image which consists of 1 million pixels.

So what we have then is a tremendous dimensionality reduction or compression,

from one million pixels down to just 10 numbers for each image.

Now wait a minute, not so fast- eigenvectors.

The eigenvector representation may be good for compression,

but it's not really very good if you want to extract

the local components or the parts of an image.

So for example, if you want to extract

the parts of a face such as the eyes, the nose, the ears,

you're not going to get that from an eigenvector analysis or

equivalently a principle component analysis of the face images.

And likewise, you're not going to be able to extract

the local components such as edges from natural scenes.

Now, this is certainly a sad day for the course

because eigenvectors have let us down for the first time.

But maybe we can resurrect the linear model so beloved to the eigenvectors.

So here is the linear model.

We have a natural scene for example that is represented by

a linear combination of a set of basis vectors or features,

so these do not have to be eigenvectors anymore.

And so, here is the equation again that captures this relationship.

And the difference now from the case of

eigenvectors that we had in the previous slide is that we are allowing M,

the number of these basis vectors or features to be larger than the number of pixels.

So why does that make sense?

Well, consider the fact that the number of

parts of objects and scenes can be much larger than the number of pixels.

So it does make sense to allow a larger value for M,

the number of basis vectors and features,

than the number of pixels.

And here's another way of writing the same equations.

So we're Replacing the summation with a matrix multiplication.

G times v where the columns of this matrix G are the different basis vectors or features,

and the vector v has elements which are

the coefficients for each of those bases vectors or features.

So the challenge before us now is to learn this matrix G,

the different basis vectors as well as for any given image we need to be able to estimate

the coefficients- this vector

v. In order to learn the basis vectors G and estimate the causes v,

we need to specify a generative model for images.

And as you recall we can define the generative model by

specifying a prior probability distribution

for the causes as well as a likelihood function.

Let's first look at the likelihood function.

So we start with our linear model from the previous slide

and if you assume that the noise vector is Gaussian,

and it's a Gaussian white noise- which means that there are

no correlations across the different components of the noise vector.

And if we assume that the Gaussian has

zero mean then we can show that the likelihood function amounts

to also a Gaussian distribution with the mean

of G times v and a covariance of just the identity matrix.

And here is what

this likelihood function then is proportional to- it's this exponential function.

And finally, if you take the logarithm of the likelihood function,

we obtain the log likelihood which now is simply just this quadratic term.

So it has just a negative one half of the square of the length of this reactor which is

simply the difference between the input image and the reconstruction of the image,

or the prediction of the image,

you're using your basis vectors.

Now, here's an interesting observation.

A lot of algorithms in engineering and in machine learning attempt to

minimize the squared reconstruction error which is just this term here.

And so, now you can see that when you are minimizing the reconstruction error,

it's the same thing as maximizing the log likelihood function,

or equivalently maximizing the likelihood of the data.

Isn't that interesting?

Now, let's define the prior probably distribution for the causes.

So one assumption you can make is that the causes are independent of each other.

And if you make that assumption,

then we have the result that the prior probability for the vector v is equal

to just the product of the individual prior probabilities for each of the causes.

Now, this assumption might not strictly hold

for natural images because some of the components

might depend on other components but let's start off

with the simplifying assumption and see where it takes us.

Now, if you take the logarithm of the prior probability distribution for v then you have

a product we now have the summation of

all the individual log prior probabilities for the causes.

Now, the question is how do we define

these individual prior probabilities for the causes?

Now, here's one answer.

We can begin with the observation that for any input

we want only a few of these causes v_i to be active.

Now, why does that make sense?

Well, if we are assuming that these causes

represent individual parts or components of natural scenes,

then for any given input which contains for example, a particular object,

only a few of these causes are going to

be activated in that particular image because those

are the parts of

that particular object and then the rest of the v_is are going to be zero.

So what we have then is that v_i for

any particular eye is going to be zero most of the time,

but it's going to be high for some inputs.

And this leads to the notion of a sparse distribution for pv_i.

And what this means is that the distribution for pv_i is going to have a peak at zero.

So it's going to be zero most of the time- v_i is going to be zero most of the time.

But the distribution is going to have a heavy tail which means

that for some inputs it's going to have a high value.

And this kind of a distribution is also called a Super Gaussian distribution.

Now, here are some examples of these super Gaussian are sparse prior distributions.

So this plot here shows three distributions all of them can be

expressed as pv equals exponential of G(v).

And the dotted distribution here is the Gaussian distribution,

and the other two distributions,

the dash as well as the solid line here,

represent the examples of sparse distributions.

And if you take the log of pv,

we get a more clear picture here of what these distributions look like.

So you can see that when G(v) equals minus the absolute value of v,

then we get an exponential distribution.

And when we have G(v) equals minus the logarithm of one plus v squared,

we get something called the Cauchy distribution.

So to summarize then,

the prior probability pv is equal to

just the product of these exponential functions and therefore,

the logarithm of the prior probabilty pv is going to equal

the summation of all these values G v_i plus some constant.

Okay, After all that hard work,

we finally arrived at the grand mathematical finale of figuring out how to find v given

any particular image and how to learn

G. And we're going to use a Bayesian approach do that.

So by Bayesian we mean that we are going

to maximize the posterior probability of the causes.

So here is p of v given u.

And from base will begin write p of v given u as

just the product of the likelihood times the prior.

And k here is just the normalization constant.

We can maximize the posterior probability by also maximizing

the log posterior- so that's the same thing as maximizing the posterior.

And so, here's the function F which is

the log posterior function and you can see how the function F has two terms.

One of them is a term containing the reconstruction error.

The other is a term containing the sparseness constraint and

we can maximize this function by essentially doing two things.

We have to minimize the reconstruction error,

but at the same time trying to maximize the sparseness constraint.

And so, you can see how this function F trades off

the reconstruction error with the sparseness constraint.

And so, we would like our representation to be sparse.

We would like only a few of these components to be active.

But at the same time we would also like to preserve information in

the images and that's enforced by this reconstruction error term.

One way of maximizing F with respect to v and G is to alternate between two steps.

The first step is maximizing F with respect to v keeping G fixed.

And the second step is maximizing F with respect to

G keeping v fixed to the value obtained from the previous step.

Now, this should remind you of the EM algorithm.

So just as in the EM algorithm,

in the E step we computed the posted probabiliy of

v. Here we are computing a value for v that maximizes F,

and similar to the EM algorithm where in the M step we updated the parameters here.

We're updating the parameter G- the matrix G,

to maximize the function F. Now,

the big question is, how do we maximize F with respect to v and G?

Well, one potential answer is to use something called

"gradient ascent" which is that we change v

for example according to the gradient of

F with respect to v. So why does this make sense?

Here's why it makes sense.

So let me draw F as a function of v. So suppose F is

this function and you can see that the value of v which maximizes F is some value here.

So let's call that v*.

And if the current value of v is let's say to the left of v*,

let's say that this is where the current value of v is,

you can look at the gradient of F with respect to

v. So you can see that it's the slope of those tangents here.

Do you think the gradient is positive or negative at this particular value?

Well, if you answered positive, you would be correct.

So it is a positive value.

So what does that mean? It means that if you update v according to this equation,

then you're going to move v in this direction.

So you are you going to add a small positive value to v and that's going to

move v in the right direction, towards v*.

Similarly, if you're on this side.

Let's say this is where your current value of v is.

I'm calling that v prime.

Then you can see that the gradient is in this case, you guessed right,

it's negative which means that you're going to subtract

a small value from your current value.

And that's going to move the value of v again,

towards the optimal value.

So either way, gradient ascent does the right thing.

Okay, let's apply the idea of gradient ascent then to our problem.

So we would like to take the derivative of F with

respect to v. And here is the expression that we get.

So G prime here denotes the derivative of our function G. And we can now look at

the way in which we should update the vector v and that's

given by this differential equation with some time constant.

And the interesting thing to note here is that we can

interpret the differential equation that we have

here for v as simply the firing rate dynamics of a recurrent network.

And so, what does this network do?

It takes the reconstruction error and it uses it

to update the activities of the recurrent network.

And it also takes into account the sparseness constrain that

encourages the output activities to be sparse.

And here is the recurrent network that implements our differential equation for v.

And so, you can see how it has both,

an input layer of neurons and an output layer of neurons.

But the interesting observation here is that

the network makes a prediction of what it expects the input to be.

So G times v is a prediction or a reconstruction of the input.

And then we take an error so u minus

G_v is the reconstruction error or the prediction error.

And that is then passed back to the output layer,

and the output layer neurons then use the error to

correct the estimates they have of the causes of the image as

given by the vector v. For any given image the network iterates by predicting and

correcting and eventually converges to a stable value for v for any given image.

We can learn these synaptic weight matrix G which contains

the basis vectors or the features that we are trying

to learn by again applying radiant ascent.

So we can say dG-dT to be proportional

to the gradient of F with respect to G. So then we take

the derivative of F with respect to G. You're going to get an expression that looks

like this so i u you minus Gv times v transpose.

And so, what we end up with then is this learning rule for

updating the synaptic weight G. It has a time constant tou_G and

that specifies the time scale at which we are going to update the rate

G. And so if he said tou_G to be bigger than the time constant we had for v,

that ensures that v converges faster than G. And so,

we have the desired property that for any given image

of v will converge fast to some particular value,

and then we can use that value for v to then update the weights for the network.

Now, if you look closely at the right hand side of the learning roll,

you'll see that it's actually Hebbian.

So you can see how it contains the term u times v. So u times v

transpose is basically the Hebbian term.

Now, it also contains a subtractive term and that actually makes this rule very,

very similar in fact almost identical to the Oja rule for learning.

So if the learning rule is almost identical to Oja's rule,

why doesn't this network then just compute the eigenvector?

So why isn't it just doing principal component analysis?

Well, the answer lies in the fact that the network is

actually trying to compute a sparse representation of the image.

And so that ensures that the network

does not just learn the eigenvectors of the covariance matrix,

it's actually learning a set of basis vectors

that can represent the input in a sparse manner.

Okay, So here's a pop quiz question.

If you feed your network some patches from natural images,

what do you think the network will learn in its matrix G?

What kind of basis vectors would you predict are learned for natural image patches?

Time for the drum roll.

The answer as first discovered by Olshausen and

Field is that the basis vectors remarkably

resemble the receptor fields in

the primary visual cortex as originally discovered by Huubel and Weisel.

So each of these square images is one vector

or one column of the matrix G. So you can obtain a vector from the square image by

collapsing each of the rows of the square image into

one long vector and that would be one column

of the matrix G. So what is this result telling us?

It's telling us that the brain is perhaps optimizing

its receptive fields to code for natural images in an efficient manner.

You can look at this model as an example

of an interpretive model- so this is going back to

the first week of our course where we

discussed the three different kinds of models and computational neuroscience.

So this would be an example of an interpretative model that provides

an ecological explanation for

the receptive fields that one finds in the primary visual cortex.

This sparse coding network that we have been discussing so far is in fact

a special case of a more general class of networks known as Predictive Coding Networks.

So here's a schematic diagram of a predictive coding network.

And the main idea here is to use feedback connections to convey predictions of the input,

and to use the feedforward connections to

convey the error signal between the prediction and the input.

And this box labeled "predictive estimator" maintains an estimate of the hidden causes,

the vector v of the input.

Now, here are some more details of the predictive coding network.

So as in the case of these sparse coding network,

there are a set of feedforward weights and a set of feedback weights.

But we also can potentially include a set of

recurrent weights and these would allow the network to model time varying input.

So for example, if the input is not just a static image, but natural movies,

then we could model the dynamics of the hidden causes of these natural movies by allowing

these estimates of the hidden causes to change

over time and that's modeled using a set of recurrent synapsis.

And additionally, one could also include

a component which is a gain on the sensory errors.

So this allows you to model certain effects such as visual attention.

Well, this brings back some fun memories for me because I worked

on these predictive coding networks as a graduate student and as a post doc.

If you're interested in more details of these predictive coding models,

I would encourage you to visit these supplementary materials on

the course website where you'll find some papers

that I wrote as a graduate student and as a post doc.

And finally, the predictive coding model suggests an answer to

a longstanding puzzle about the anatomy of the visual cortex.

Here's a diagram by Gilbert and Lee of the connections between

different areas of the visual cortex and the puzzle is this.

Every time you see a feedforward connection such as the one

from V1 to V2 given by the blue arrow,

you almost always also find a feedback connection from the second area to the first.

So in this case from V2 back to V1.

So why is there always

a feedback connection for every feedforward connection between two cortical areas?

And here's a schematic depiction of this puzzle.

Information from the retina as you know has passed

on to the LGN or lateral geniculate nucleus.

And then the information is passed on to cortical area V1,

cortical area V2, and so on.

But for every set of feedforward connection there seems to

be a corresponding set of feedback connections.

So what could be the role of these feedforward and feedback connections?

The predictive coding model suggest

interesting functional roles for the feedforward and feedback connections.

According to the predictive coding model the feedback connections convey predictions of

the activities in the lower cortical areas from a higher cortical area.

And the feedforward connections between one cortical area to the next

convey the error signal between the predictions and the actual activities.

It turns out that it can explain

certain interesting phenomena that people have observed in the visual cortex

known as "Contextual effects" or surround suppression or surround effects.

And these effects can be explained in an interesting manner by

the hierarchical predictive coding model that we

have here when it is trained on natural images.

So I would encourage you to go to

the supplementary materials on

the course web website if you're interested in more details.

Okay amigos and amigas,

that wraps up this lecture.

Next week we'll learn how neurons can act as classifiers,

and how the brain can learn from rewards using reinforcement learning.

Until then adios and goodbye.