0:00

In this video, I'm going to talk about improving generalization by reducing the

overfitting that occurs when a network has too much capacity for the amount of data

it's given during training. I'll describe various ways of controlling

the capacity of a network. And I'll also describe how we determine

how to set the metric parameters when we use a method for controlling capacity.

I'll then go on to give an example where we control capacity by stopping the

learning early. Just to remind you, the reason we get

over-fitting is because as well as having information about the true regularities in

the mapping from the input or output, any finite set of training data also contains

sampling error. There's accidental regularities in the

training set, just because of the particular training cases that were

chosen. So when we fit the model, it can't tell

which of the regularities are real, and would also exist if we sampled the

training set again, And which are caused by the sampling

error. So the model fits both kinds of

regularity. And if the model's too flexible, it'll fit

the sampling error really well, and then it'll generalize badly.

1:19

So we need a way to prevent this over fitting.

The first method I'll describe is by far the best.

And it's simply to get more data. There's no point coming up with fancy

schemes to prevent over fitting if you can get yourself more data.

Data has exactly the right characteristics to prevent over fitting.

The more of it you have the better. Assuming your computer's fast enough to

use it. A second method is to try and judiciously

limit the capacity of the model so that it's got enough capacity to fit the true

regularities but not enough capacity to fit the spurious regularities caused by

the sampling error. This of course is very difficult to do.

And I'll describe in the rest of this lecture, various approaches to trying to

regulate the capacity appropriately. In the next lecture, I'll talk about

averaging together many different models. If we average models that have different

forms and make different mistakes, the average will do better than the individual

models. We could make the models different just by

training them on different subsets of the training data.

This is a technique called bagging. There's also other ways to mess with the

training data to make the models as different as possible.

3:45

A very common way to control the capacity of a neural network is to give it a number

of hidden lairs or units per lair is a little to large, but then to penalize the

weights using penalties or constraints using squared values of the weights or

absolute values of the weights. And finally, we can control the capacity

of a model by adding noise to the weights, or by adding noise to the activities.

Typically, we use a combination of several of these different capacity control

methods. Now for most of these methods, there's

meta parameters that you have to set. Like the number of hidden units, or the

number of layers, or the size of the weight penalty.

4:32

An obvious way to transit those meta parameters is to try lots of different

values of one of the meta parameters like, for example, the number of hidden units,

and see which gives the best performance on the test set.

But there's something deeply wrong with that.

It gives a false impression of how well the method will work if you give it

another test set. So the settings that work best for one

particular test set are unlikely to work as well on a new test set that's drawn

from the same distribution because they've been tuned to that particular test set.

And that means you get a false impression of how well you would do on a new test

set. Let me give you an extreme example of

that. Suppose the test set really is random,

quite a lot of financial data seems to be like that.

So the answers just don't depend on the inputs or can't be predictive from the

inputs. If you choose the model that does best on

your test set, that will obviously do better than chance because you selected it

to do better than chance. But if you take that model and try it on

new data that's also random, you can't expect it to do better than chance.

So by selecting a model, you got a false impression of how well a model will do on

new data and the question is, is there a way around that?

6:13

You hold back some validation data, which isn't going to be used for training.

But is going to be used for deciding how to set the meta parameters.

In other words, you're going to look at how well the model does on the validation

data to decide what's an appropriate number of hidden units or an appropriate

size of weight penalty. But then once you've done that, and

trained your model with what looks like the best number of hidden units and the

best weight penalty, You're then going to see how well it does

on the final set of data that you've held back which is the test data.

And you must only use that once. And that'll give you an unbiased estimate

of how well the network works. And in general that estimate will be a

little worse than on the validation data. Nowadays in competitions, the people

organizing the competitions have learned to hold back that true test data and get

people to send in predictions so they can see whether they really can predict on

true test data, or whether they're just over-fitting to the validation data by

selecting meta-parameters that do particularly well on the validation data

but won't generalize to new test sets. One way we can get a better estimate of

our weight penalties or number of hidden units or anything else we're trying to fix

using the validation data, is to rotate the validation set.

So, we hold back a final test set to get our final unbiased estimate.

But then we divide the other data into N equal sized subsets and we train on all

but one of those N, and use the Nth as a validation set.

Then we can rotate and a hold back a different subset as a validation set, and

so we can get many different estimates of what the best weight penalty is, or the

best number of hidden units is. This is called N-fold cross-validation.

It's important to remember, the N different estimates we get are not

independent of one another. If for example, we were really unlucky and

all the examples of one class fell into one of those subsets,

We'd expect to generalize very badly. And we'd expect to generalize very badly,

whether that subset was the validation subset or whether it was in the training

data. So now I'm going to describe one

particularly easy to use method for printing over fitting.

It's good when you have a big model on a small computer and you don't have the time

to train a model many different times with different numbers of hidden units or

different size weight penalties. What you do is you start with small

weights, and as the model trains, they grow.

And you watch the performance on the validation set.

And as soon as it starts to get worse, you stop training.

9:00

Now, the performance civilization on the set may fluctuate particularly if you're

error rate rather than a squared error or presentory error.

And so its hard to decide when to stop and so what you typically do is keep going

until you're sure things are getting worse and then go back to the point at which

things were best. The reason this controls the capacity of

the model, is because models with small weights generally don't have as much

capacity, and the weights haven't had time to grow big.

9:37

So consider a model with some input units, some hidden units, and some output units.

When the weight's very small, if the hidden unit's a logistic units, their

total inputs will be close to zero, and they'll be in the middle of their linear

range. That is, they'll behave very like linear

units. What that means is, when the weights are

small, the whole network is the same as a linear network that maps the inputs

straight to the outputs. So, if you multiply that weight matrix W1

by that weight matrix W2, you'll get a weight matrix that you can use to connect

the inputs to the outputs and provided the weights are small, a net with a layer of

logistic hidden units will behave pretty much the same as that linear note.

Provided we also divide the weights in the linear note by four, which take into

account the fact that when there's hidden units there, in that linear region, and

they have a slope of a quarter. So it's got no more capacity than the

linear net, so even though in that network I'm showing you there's three six + six

two weights, it's really got no more capacity than a network with three two

weights. That's the way its grow.

We start using the non linear region of the sequence.

And then we start making use of all those parameters.

11:06

So if the network has six weights at the beginning of learning and has 30 weights

at the end of learning, Then we could think of the capacity as

changing smoothly from six perimeters to 30 perimeters as the weights get bigger.

And what's happening in early stopping is we're stopping the learning when it has

the right number of parameters to do as well as possible on the validation data.

That is when it's optimized the trade off between fitting the true regularities in

the data and fitting the spurious regularities that are just there because

of the particular training examples we chose.