0:00

Welcome back In the previous lecture, we considered this dataset and we asked the

question, can a network of neurons learn to represent such data?

When we applied the co-variance rule as you'll recall, to this dataset, it ended

up finding a weight vector, that was aligned with the direction of maximum

variance of this dataset. But we noted at the time, that this was

not a very satisfying model of this data set, because the input data appear to be

consist of these two clusters of data points.

So lets ask the question, can neurons learn to represent such clusters?

Here is one way in which we can use neurons to represent clusters.

So lets use a feedforward network with two output neurons, neuron A and neuron

B. And lets use neuron A to represent

cluster A, and neuron B to represent cluster B.

So we can do that by making the weight vector WA, be the center of cluster A,

and the weight vector WB, be the center of cluster B.

And so now since this is the feedforward network, so here is the input component 1

and input component 2. So U1 and U2 together comprise the vector

for the input U. Here is the output of each of these two

neurons, so it's just the dot product between the weight vector and the input

vector. So the question that I would like to ask

you is, if I give you a particular input such as this one here, which neuron do

you think will fire the most? In other words, which neuron will have a

higher output firing rate? Is it neuron A or neuron B, for this

particular input? Notice that this particular input is

closer to neuron B. So the distance from here to the center

of the cluster, so that is the distance from this input to WB, seems to be

shorter than the distance from this input to the weight vector WA, which is the

center of cluster A. So, which neuron do you think will have a

higher activity? Neuron A or neuron B?

If you answered neuron B you would be correct.

The most active neuron in the network, is going to be the one whose weight vector

is closest to an input. So in this case, for this particular

input U, the closest weight vector is WB, and therefore, the most active neuron is

also going to be the neuron B. And we can show that by looking at the

Euclidean distance between the input and each of these two weight vectors, WA or

WB. And so the square of the Euclidean

distance turns out to be equal, if you simplify all these terms.

It turns out to be equal to be to the square of the length of the input vector,

plus the square of the length of the rate vector, minus 2 times the output activity

of the neuron. And so if we assume that the length of

the input vector has been normalized to have length 1, and similarly the length

of the weight vector has been normalized to also have, let's say a length of 1.

Then minimizing the squared distance between the input and the weight vector,

turns out to be the same as maximizing the activity of that particular neuron.

Now suppose I give you a new input, UT. And here's that new input.

How will you update the weights of the two neurons, given this new input UT?

Well lets think about that for a little bit.

So the first thing that we need to do, is perhaps figure out which cluster this new

input belongs to. Is it cluster A, or cluster B?

And we can do that, by looking at the distance between that new data point in

each of these centers of the clusters. So in this case, the distance between WA,

and this new data point as well as WB in that particular data point, and it

appears in this case that the cluster A is the one that this new input might

belong to. Because the distance from that input to

the center of the cluster, is the shortest.

And so now that we've figured out which cluster this new input might belongs to,

we can update the weight, which is now the center of the cluster, to now include

this new data point. So how would we do that?

One way of doing that is to set the weight vector to be the running average

of all the inputs, in that particular cluster.

So the running average of all the inputs in this cluster, including this new

input. So do you remember the equation for

computing the running average? Well here it is.

So we can derive the equation for the running average, by starting out with the

expression for the average, which all know, which is just the summation of the

data points in this cluster A, divided by the number of such points, which we're

calling T. And so if we express the sum now, as the

sum of all the data points except for the data point T, and then we simply it.

What we find is an equation that has both the weight vector before we got the new

data point, plus this additional term, which includes the new data point.

And so we can now write this weight update rule.

The delta W, which is the change in the weight vector W for the neuron A, is

equal to some epsilon times just the difference between the new input, and the

weight vector for that particular neuron. Now we can epsilon to be equal to 1 over

T, and that would make this equation compute the running average.

Or you can keep epsilon as some small positive value, and that allows the

method to adapt to new inputs for an indefinite period of time.

So the 1 over T would make epsilon go to 0 for very large T.

But if you keep epsilon to be some constant positive value, then that allows

the algorithm to remain adaptive to new inputs, for an indefinite period of time.

6:11

Okay, we're now ready to make our acquaintance with the competitive

learning algorithm. Which is a famous algorthim in neural

networks thoery. So in competitive learning, given a input

we pick the most active neuron in the network.

And this we can do, for example, by using the winning takes all network that we

discussed in our previous lecture on recurrent networks.

And as we discussed earlier, the most active neuron also corresponds to the one

whose weights are closest to the new input.

And once we have a winning neuron, we can then update the weights for that neuron,

and we do that using the weight update rule from the previous slide.

Now what is the effect of this rule on the weights for the winning neuron?

Well, so here is what that looks like. So initially the weight vector for the

winning neuron, in this case, neuron A, was here in the center of the cluster.

And the effect of this learning rule this weight update rule, is to move the weight

vector slightly towards the direction of the new input.

And in doing so, what we're doing is updating the weights to be equal to the

average, of all the data points in this cluster including the new data point.

7:33

Okay lets look at an example of competitive learning.

Suppose you have these green points as the input, and these are 2-dimensional

points. So these constitute the U1 and U2, in the

input vector. And suppose you have three output

neurons, and the output neurons have the weight W1, W2 and W3.

And lets randomly assign the values for these weights.

So W1 turns out to be here, W2 turns out to be here, and W3 down here.

Lets look at what happens to these weights, as we give the network these

green inputs one by one. So suppose I give the network this input

first. How do you think these weights will be

changed? Well, according to the competitive

learning algorithm, the closest weight to this particular input is W1, and

therefore W1 is moved to be closer to that particular input.

Now, if the second input let's say is this one, then again the winning neuron

is going to be neuron number 2, so W2 is going to be adapted, and it will move

closer to that particular input. And finally, if the 3rd input lets say is

this one, then as you would expect the winning neuron is going be neuron number

3, and so it's weight is going to be adapted to be closer to that particular

input. Now if we keep repeating this procedure

for more than just these 3 inputs, than here's what the weights might look like

after you've updated them, using the competitive learning rule.

9:05

As you can see, the three neurons have partitioned the data set into three

different clusters, given by the red points, the blue points, and the green

points. And the weight vectors of the three

neurons, W1, W2, and W3, now represent the centers of these three clusters.

Competitive learning is also closely related to self organizing maps, also

known as Kohonen maps after the Finnish professor who first proposed these maps.

And in self organizing maps, just as in competitive learning, we pick the winning

neuron given in any particular input. And as in competitive learning, we update

the weights for the winning neuron. But unlike competitive learning, we also

update the weights of other neurons, which are in the neighborhood of the

winning neuron. So, what do we mean by the neighborhood

of the winning neuron? Well, in self organizing maps, we have

locations assigned to the neurons in the network.

And so each of these would correspond to a location on a grid, such as this

2-dimensional grid, that are assigned to the individual neurons.

And so when we have some inputs such as the inputs shown here, in this cloud of

data, then we not only update the weights of this individual neuron.

So we would move the weights for this winning neuron towards that particular

data point, but we'd also update the weights of the neighbors.

So this neuron, the one here and the one here, we would update these weights also

to be closer to that particular data point, resulting in a transformation of

the 2-dimensional grid, to look something like this.

And so this morphing of the grid is a result of updating the weights of the

winning neuron and It's neighbors. and if we keep doing that for all the

different points on this data cloud, so this could be a very high dimensional

cloud of data, then we end up with a representation that looks something like

this. And here you can see that we have now

transformed a very high dimensional data set, a potentially high dimensional data

set. We've mapped it now onto a 2-dimensional

representation, that preserves the topological properties of the input

space. Interestingly, in the brain, one also

finds a 2-dimensional map that preserves the topological properties of the input

space. In particular if you look at the primary

visual cortex or V1, we find are called orientation preference maps.

And here's an example of an orientation preference map, where each of these

colors represents one particular orientation, that the neurons in that

particular region of cortex prefer. And so if you're recording from one

particular location, let's say in blue here, and you move along in a particular

direction, you find that the neighboring neurons prefer similar orientations.

And that's very similar to the fact that in the self organizing map also,

neighboring neurons are going to prefer similar inputs.

13:10

Well, the answer as you might have guessed, is a resounding yes.

There's a whole field, called unsupervised learning, that addresses

this question. In unsupervised learning you assume that

your inputs, these data points u, are being generated by a set of hidden

causes, that we called v. And the relationship between the causes

and the data point or inputs that you're observing, is given by a generative

model. Now the causes are selected according to

some prior distribution. So the prior probability distribuation of

the causes. And there's a set of parameters here G,

that we would like to learn. And once you have a particular cause that

is selected according to this distribution, then the generative model

assumes that the data point u is generated according to some likelihood

function, that's given by p of u given v and again G is some set of parameters.

So what is the unsupervised learning problem then?

Well the unsupervised learning problem comes down to learning these parameters

G, that characterize both the likelihood function, as well as the prior property

distribution of the causes. Let's look at what this means for our

example of the two clusters. So, we could use a generative model in

this case, which assumes there are two Gaussians that are generating these data

points. And each of these two Gaussians has a

mean and a standard deviation. And so the way that these points would be

generated then, according to this generative model, is that you first

select one of these two Gaussians according to some prior property

distribution given by p of v. And given a particular Gaussian, then you

generate one of these points within that cluster.

We can now write the property distribution of all of our inputs, as a

mixture of Gaussians. So here's the expression for the mixture

of Gaussians. And you can see that it contains the term

corresponding to the Gaussian for each of the Gaussians in our model, as well as

the prior Property for that particular Gaussian.

And therefore the parameters for this particular generative model, includes the

mean and standard deviation for each of the Gaussians, as well as the prior

property which we are calling Gamma, for each Gaussian.

So, to summarize, the goal of unsupervised learning, is to learn a good

generative model for the data that you're seeing.

In other words, we would like to mimic the data generation process.

And we find that in general, if you want to do that, we need to solve two

sub-problems. So what are these two sub-problems?

The first one corresponds to the problem of recognition.

So we mean by recognition? We mean that we would like to estimate

the causes v, given any particular input. So for any particulate input, we would

like to estimate the posterior property of the causes given the particular input

u. And once we have done that, it turns out

we can typically use that information to learn the parameters G.

So let's look at an example of how this approach can be applied to a specific

unsupervised learning problem. Lets look at the problem of clustering.

So as you recall, we were modelling this particular data set of two clusters, as a

mixture of Gaussians. And therefore, the parameters that we

would like to learn are the mean and standard deviation of each Gaussian, as

well as the prior property for each Gaussian.

So how does the approach from the previous slide apply to this problem?

Well given a particular data point such as this one, the first sub problem

involves finding out the posterior property of the causes, given this data

point and the parameters g. And the second problem then, is given the

posterior property distribution over causes, for that particular data point,

update the parameters g. This is actually very similar to

competitive learnings. So in competitive learning as you recall,

you also had two steps. In the first step, given a data point

such as this one here, we would assign the data point to a specific cluster.

So in this case, we would assign this data point to cluster A.

But now we're going to compute the posterior property of this data point

belong to either cluster A, or cluster B. And so presumably in this case we would

have a higher posterior property, that this data point belongs to cluster A.

So p of v given u and G, would have a higher property when v is equal to A,

than when v is equal to B. And the second step in competitive

learning was, changing the weight only for the cluster A.

Where as now, we're going to use the posterior property to then change the

parameters for both the Gaussian for cluster A and the Gaussian for cluster B.

So let's see how we can do that. The algorithm for learning the parameters

is called the Expectation Maximization algorithm or EM algorithm for short.

The EM algorithm is actually a very general algorithm for unsupervised

learning, and it has its rules in statistics.

It involves a trading between two steps, one is called the E step or the

expectation step, and the other is called the M step, or the maximization step.

So lets look at what these two steps mean in the case of our example of clustering.

18:45

So here's the E step. And as you might expect, if you pardon

the pun, the E step involves computing the posterior property distribution of

the causes. Which in this case is whether the cause

is Gaussian A or Gaussian B, and we can do that by using base rules.

So you can compute the posterior property of the cause, given the input by using

base rule, and we can substitute the normal or the Gaussian distribution, for

each of the likelihood functions for our inputs.

And so we end up with an expression that looks like this.

And you can see that this expression, implements a form of sort competition, so

it's not a winner takes all type of competition.

It's a much democratized version of competition, where we assigning a

posterior property for each cause, as opposed to just declaring one these

clusters, or these causes to be the winner.

And here's the M step. And the M step involves changing the

parameters G, using the results from the E steps.

So here are the equation for changing the mean, the variance, and the prior

property for each of our Gaussians. And you can see how the equations used

the posterior property for each Gaussian, computed in the E step.

It's interesting to compare this algorithm to competitive learning.

So in the case of competitive learning, we only learn the mean of each cluster.

And so we did not have an estimate of the variance of each cluster, or the prior

property of each cluster. The other difference with competitive

learning is that, the EM algorithm assumes you have all the data points at

once. And so you can use all the data that

compute estimates such as this. Whereas, in the case of competitive

learning, we got the input one by one, and so we can call the EM algorithm a

batch learning algorithm. Whereas, the competitive learning

algorithm was an online learning algorithm.

Here is the results for applying the EM algorithm to our data set of two cluster.

The dash line here, represents the trajectory of the mean, and the circles

represent two standard deviations from the mean, for each of the Gaussian A and

B. And so as you see here, with each

iteration, the estimate of the mean gets better and better, as does the estimate

of the standard deviation. Until we get to iteration number 50,

where you can see that the mean of each distribution is quite accurately modeled,

and the standard deviation of each of these clusters seems to approximate quite

well, the standard deviation of the Gaussians that generated the data.