0:00

This video introduces the learning algorithm for a linear neuron.

This is quite like the learning algorithm for a perceptron, but it achieves

something different. In a perceptron, what's happening is the

weight, so always getting closer to a good set of weights.

In a linear neuron, the outputs are always getting closer to the target outputs.

0:25

The perception convergence procedure works by ensuring that when we change the

weights, we get closer to a good set of weights.

That type of guarantee cannot be extended to more complex networks.

Because in more complex networks when you average two good set of weights, you might

get a bad set of weights. So for multilayer neural networks, we

don't use the perceptron learning procedure.

And to prove that when they're learning something is improving, we don't use the

same kind of proof at all. They should never have been called

multilayer perceptrons. It's partly my fault and I'm sorry.

For multilayer nets we're gonna need a different way to show that the learning

procedure makes progress. Instead of showing that the weights get

closer to a good set of weights, we're gonna show that the actual output values

get closer to the target output values. This can be true even for non-convex

problems in which averaging the weights of two good solutions does not give you a

good solution. It's not true for perceptual learning.

In perceptual learning, the outputs as a whole can get further away from the target

outputs even though the weights are getting closer to good sets of weights.

1:45

The simplest example of learning in which you're making the outputs get closer to

the target outputs is learning in a linear neuron with a squared error measure.

Linear neurons, which are also called linear filters in electrical engineering,

have a real valued output that's simply the weighted sum of their outputs.

So the output Y, which is the neuron's estimate of the target value, is the sum

over all the inputs i of a weight vector times an input vector.

So we can write it in summation form or we can write it in vector notation.

2:45

So one question is why don't we just solve it analytically.

It's straightforward to write down a set of equations with one equation per

training case, and to solve for the best set of weights.

That's the standard engineering approach, and so why don't we use it?

The first answer, and the scientific answer, is we'd like to understand what

real neurons might be doing, and they're probably not solving a set of equations

symbolically. An engineering answer is that we want a

method that we can then generalize to multilayer, nonlinear networks.

The analytic solution relies on it being linear and having a squared error measure.

An iterative method, which we're gonna see next, is usually less efficient, but much

easier to generalize to more complex systems.

3:36

So I'm now gonna go through a toy example that illustrates an iterative method for

finding the weights of a linear neuron. Suppose that every day, you get lunch at a

cafeteria. And your diet consists entirely of fish,

chips, and ketchup. Each day, you order several portions of

each, but on different days, it's different numbers of portions.

The cashier only shows you the total price of the meal, but after a few days, you

ought to be able to figure out what the price is for each portion of each kind of

thing. In the iterative approach, you start with

random guesses for the prices of portions. And then you adjust these guesses so that

you get a better fit to the prices that the cashier tells you.

Those are the observed prices of whole meals.

4:27

So each meal, you get a price and that gives you a linear constraint on the

prices of the individual portions. It looks like this, the price of the whole

meal is the number of portion of fish, x fish, times the cost of a portion of fish,

w fish. And the same for chips and ketchup.

5:12

So let's suppose that the true weights that the cashier using to figure out the

price, are 150 for a portion of fish, 50 for portion of chips and a 100 for a

portion of Ketchup. For the meals shown here, that will lead

to a price of 850. So that's going to be our target value.

5:40

So for the meal with two portions of fish, five of chips, and three of ketchup, we're

going to initially think that the price should be 500.

That gives us a residual error of 350. The residual error is the difference

between what the cashier says and what we think the price should be with our current

weights. We're then gonna use the delta rule for

revising our prices of portions. We make the change in a weight, delta WI

be equal to a learning rate, epsilon times the number of portions of the i-th thing,

times the residual error. The difference between the target and our

estimate. So if we make the learning rate be one

over 35, so the maths stays simple, then the learning rate times the residual error

for this particular example is ten. And so, our change in the weight for fish

will be two times ten. We'll increase that weight by twenty.

Our change in the weight for chips will be five times ten.

And our change in the weight for ketchup will be three times ten.

6:56

That'll give us new weights of 70, 100, and 80.

And notice, the weight for chips actually got worse.

There's no guarantee with this kind of learning that the individual weights will

keep getting better. What's getting better is the difference

between what the cashier says and our estimate.

7:21

We start by defining the arrow measure, which is simply our squared residual

summed over all training cases. That is the squared difference between the

target and what the neural net predicts. Or the linear neuron predicts.

Squared, in some liberal training cases. And we put a one-half in front, which will

cancel the two, when we differentiate. We now differentiate that error measure

with respect to one of the weights, WI. To do that differentiation we need to use

the chain rule. The chain rule says that how the error

changes as we change a weight, will be how the output changes as we change the

weight, times how the error changes as we change the output.

The chain rule is easy to remember, you just cancel those two DYs but you can only

do that when there's no mathematicians looking.

8:17

The reason the first one, DY by DW is written with a curly D is because it's a

partial derivative. That is, there's many different weights

you can change to change the output. And here, we're just considering the

change to weight i. So, DY by DWi, is actually equal to Xi,

and that's because Y is just Wi times Xi, and DE by DY, is just T minus Y, because

when we differentiate that T minus Y squared, and use the half to cancel the

two we just get T minus Y. So our learning rule is now, we change the

weights by an amount that's equal to the learning rate epsilon times the derivative

of the error with respect to a weight, to E by DWi.

And with a minus sign in front cuz we want the error to go down.

And that minus sign cancels the minus sign in the line above and we get that.

The change in a weight is the sum of all training cases of the learning rate times

the input value times the difference between the target and actual outputs.

9:49

There may be no perfect answer. It may be that we give the linear neuron a

bunch of training cases with desired answers.

And there's no set of weights that'll give the desired answer.

There's still some set of weights that gets the best approximation on all those

training cases, minimizes that error measure.

Some that are all training cases. And if we make the learning rate small

enough and we learn for long enough, we can get as close as we like to that best

answer. Another question is, how quickly do we get

towards the best answer. And even for a linear system.

The learning can be quite slow in this kind of intricate learning.

If two input dimensions are highly correlated, its very hard to tell how much

of the sum of the weight on both input dimensions should be attributed to each

input dimension. So if for example, you always get the same

number of portions of ketchup and chips is, we can't decide how much of the price

is due to the ketchup and how much is used to the chips.

And if they're almost always the same, it can take a long time for the learning to

correctly attribute the price to the ketchup and the chips.

There's an interesting relationship between the delta rule and the learning

rule for perceptrons. So, if you, you use the online version of

the delta rule, but we change the weights after each training case, it's quite

similar to the perceptron learning rule. In perceptron learning, we increment or

decrement the weight vector by the input vector, but we only change the input

vector when we make an error. In the online version of the delta rule,

we increment or decrement the weight vector by the imperfector.

But we scale that by both the residual error and the learning rate.

And one annoying thing about this is we have to choose a learning rate.

If we choose a learning rate that's too big, the system will be unstable.

And if we choose a learning rate that's too small, it will take an unnecessarily

long time to, to learn a sensible set of weights