In this video, we're going to look at how to

finally learn how to fit our distribution of heights data.

Then in the following exercise,

you will actually do that for yourself.

And then in the next video, we will wrap up with looking

at how to do this practically in Matlab or Python.

So we're looking at how to fit a function that's arbitrarily

complicated compared to the simplest case of linear regression,

y equals MX plus C,

that we looked at last time.

Of course, there's intermediate possibility between

very complicated and the simplest possible,

but this gives you the general case before we move on to look at

how to use a computer using tools instead of writing our own.

So let's say we have a function y of

some variable x and that function has parameters in it, AK,

where K goes from 1 all the way up to M. So there's m of these M. So for example,

we could have X minus A1 squared plus A2 to be an example of some function y.

This function isn't linear in A1.

If I double A1,

I don't double the function.

So it's a nonlinear least squared we're going to do.

Now, say I want to fit the parameters AK to some data.

So I've got some data observations.

I've got I equals 1 to N of them,

and I've got pairs of data YI and XI.

So for every X, I've got a Y,

and I have an associated uncertainty, Sigma I.

That is the more uncertain I am about the data point YI,

the bigger the uncertainty Sigma is going to be.

So I can sketch that out.

Something like this for instance as an example of my YIs and my XIs.

So I've got an XI,

YI, and each one has got an uncertainty Sigma I there.

Then, I'm going to define a goodness of fit parameter chi squared

as being the sum over all the data points I

of the difference between the YI and

the model of XI with its parameters AK.

I'm going to define all of those by Sigma squared.

I'm going to take the squares of the differences.

So what I'm doing here is I'm penalising each of

these differences by uncertainty, Sigma squared,

when I make chi squared, so that uncertain data points have

a low weight in my sum of chi squared so that they don't affect the fit too much.

If we don't know what the Sigmas are,

we could assign them all to be one and so this will just drop out.

But if we have an idea of the measure of uncertainties,

this gives us a way to include it.

And my minimum chi squared is of course going to be when

the grad of chi squared is equal to zero.

Now, in a general case, I might be out to write down an expression for grad here,

but I might not be able to solve it algebraically.

So instead, I'm going to look to solve grad chi squared equals

zero by steepest descent going down

the contours simply by updating the vector of fitting parameters A.

So I've got my vector A. I'm going to say that my next iteration is going to be

my current iteration minus some constant times the grad of chi squared.

So I'm going to go down the gradient here by an amount given by the constant

which is sort of a pull handle for how aggressive I

want to be in going down the steepest descent,

and that I'm going to use to make

my next guess for what the fitting parameters should be.

I'm going to write them down as a vector.

And I'll keep doing that until I reach this criteria that the grad of

chi squared is zero, which means I found the minimum,

or failing that, until chi squared stops changing which should be the same thing,

or you just give up because it has been

so many iterations and I get bored and I decide that something's gone wrong.

So to do this grad,

I've got to differentiate chi squared.

So I've got to D chi squared by DAK for each of the K's in turn.

And when I do that, while the sum has nothing to do with K,

because it has to do with I,

and I'm going to get the 2 down,

and I'm gonna get the Sigmas squared has nothing to do with K. When I differentiate this,

I'll get the 2 down and then I'll have the bracket itself,

YI minus Y of XI and the AK.

And then I'm going to get the differential of the bracket.

I'll get a minus sign out of that,

and I get DY, DAK.

So I get DY, DAK.

So that's going to be my differential, there.

The minus 2, I can just take out.

So when I come to update here,

If I wrap the minus 2,

into the constant, I can just make that a plus, ignore the 2.

So now I'm going to get a current plus,

this thing, the sum from I equals 1 to N. It's the minus signs that go.

I'll wrap the 2 into the constant then I'll just get this,

YI minus Y of XI and the AKs divided by Sigma

squared times the differential

evaluated at A_current because I don't know A_next yet.

So the steepest descent formula here is just going to

be the A_next is equals to A_current plus this sum,

and I've got to be able to differentiate Y with respect to AK.

And this is one of these formula that look really intimidating,

but really isn't when you actually try to use it.

For our example here,

we take this example here.

If we differentiate that with respect to A1,

then we'll get that DY by A1

is equal to minus 2X minus A1.

I take the 2 down and I'll get a minus sign when I differentiate the stuff in the bracket.

And while I do DY by DA2,

I'm just going to get 1.

So, it's actually really easy when we come to finally use it,

but the expression looks intimidating.

So that's the steepest descent formula for the case of fitting a nonlinear function

when we're trying to minimise the sum of the squares of the residuals.

And next is therefore called nonlinear least squares fitting.

There are lots of more complicated methods in

the steepest descent for solving these sorts of problems,

which we'll look at a little bit next time.

But first, I want you to try and give this a go,

and code it up and see how it looks for the sandpit problem that you were doing before.

So that's the simplest version of how to do a general fitting finding the minimum or

least value of the sum of the squares of the residuals for

a model that's non-linear in both the functions and the fitting parameters.

So it is called generalised nonlinear least squares fitting.