0:00

In addition to L2 regularization and drop out regularization there

Â are few other techniques to reducing over fitting in your neural network.

Â Let's take a look.

Â Let's say you fitting a CAD crossfire.

Â If you are over fitting getting more training data can help, but getting more

Â training data can be expensive and sometimes you just can't get more data.

Â But what you can do is augment your training set by taking image like this.

Â And for example, flipping it horizontally and

Â adding that also with your training set.

Â So now instead of just this one example in your training set,

Â you can add this to your training example.

Â So by flipping the images horizontally,

Â you could double the size of your training set.

Â Because you're training set is now a bit redundant this isn't as good as if you had

Â collected an additional set of brand new independent examples.

Â But you could do this Without needing to pay the expense of going out to take

Â more pictures of cats.

Â And then other than flipping horizontally,

Â you can also take random crops of the image.

Â So here we're rotated and sort of randomly zoom into the image and

Â this still looks like a cat.

Â So by taking random distortions and translations of the image you could

Â augment your data set and make additional fake training examples.

Â Again, these extra fake training examples they don't add as much information as they

Â were to call they get a brand new independent example of a cat.

Â But because you can do this, almost for free, other than for

Â some confrontational costs.

Â This can be an inexpensive way to give your algorithm more data and

Â therefore sort of regularize it and reduce over fitting.

Â And by synthesizing examples like this what you're really telling your algorithm

Â is that If something is a cat then flipping it horizontally is still a cat.

Â Notice I didn't flip it vertically,

Â because maybe we don't want upside down cats, right?

Â And then also maybe randomly zooming in to part of the image it's probably

Â still a cat.

Â For optical character recognition you can also bring your data set by taking digits

Â and imposing random rotations and distortions to it.

Â So If you add these things to your training set,

Â these are also still digit force.

Â 2:14

For illustration I applied a very strong distortion.

Â So this look very wavy for, in practice you don't need to distort the four quite

Â as aggressively, but just a more subtle distortion than what I'm showing here,

Â to make this example clearer for you, right?

Â But a more subtle distortion is usually used in practice,

Â because this looks like really warped fours.

Â So data augmentation can be used as a regularization technique,

Â in fact similar to regularization.

Â There's one other technique that is often used called early stopping.

Â So what you're going to do is as you run gradient descent you're going to plot

Â your, either the training error,

Â you'll use 01 classification error on the training set.

Â Or just plot the cost function J optimizing, and

Â that should decrease monotonically, like so, all right?

Â Because as you trade, hopefully,

Â you're trading around your cost function J should decrease.

Â So with early stopping, what you do is you plot this, and

Â you also plot your dev set error.

Â 3:17

And again, this could be a classification error in a development sense, or something

Â like the cost function, like the logistic loss or the log loss of the dev set.

Â Now what you find is that your dev set error will usually go down for

Â a while, and then it will increase from there.

Â So what early stopping does is, you will say well,

Â it looks like your neural network was doing best around that iteration, so

Â we just want to stop trading on your neural network halfway and

Â take whatever value achieved this dev set error.

Â So why does this work?

Â Well when you've haven't run many iterations for

Â your neural network yet your parameters w will be close to zero.

Â Because with random initialization you probably initialize w to small random

Â values so before you train for a long time, w is still quite small.

Â And as you iterate, as you train, w will get bigger and bigger and bigger until

Â here maybe you have a much larger value of the parameters w for your neural network.

Â So what early stopping does is by stopping halfway you have only

Â a mid-size rate w.

Â And so similar to L2 regularization by picking a neural network with smaller

Â norm for your parameters w, hopefully your neural network is over fitting less.

Â And the term early stopping refers to the fact that you're just

Â stopping the training of your neural network earlier.

Â I sometimes use early stopping when training a neural network.

Â But it does have one downside, let me explain.

Â I think of the machine learning process as comprising several different steps.

Â One, is that you want an algorithm to optimize the cost function j and

Â we have various tools to do that, such as grade intersect.

Â And then we'll talk later about other algorithms, like momentum and

Â RMS prop and Atom and so on.

Â But after optimizing the cost function j, you also wanted to not over-fit.

Â And we have some tools to do that such as your regularization,

Â getting more data and so on.

Â Now in machine learning, we already have so many hyper-parameters it surge over.

Â It's already very complicated to choose among the space of possible algorithms.

Â And so I find machine learning easier to think about

Â when you have one set of tools for optimizing the cost function J,

Â and when you're focusing on authorizing the cost function J.

Â All you care about is finding w and b, so that J(w,b) is as small as possible.

Â You just don't think about anything else other than reducing this.

Â And then it's completely separate task to not over fit,

Â in other words, to reduce variance.

Â And when you're doing that, you have a separate set of tools for doing it.

Â And this principle is sometimes called orthogonalization.

Â And there's this idea, that you want to be able to think about one task at a time.

Â I'll say more about orthorganization in a later video, so

Â if you don't fully get the concept yet, don't worry about it.

Â But, to me the main downside of early stopping is that

Â this couples these two tasks.

Â So you no longer can work on these two problems independently,

Â because by stopping gradient decent early,

Â you're sort of breaking whatever you're doing to optimize cost function J,

Â because now you're not doing a great job reducing the cost function J.

Â You've sort of not done that that well.

Â And then you also simultaneously trying to not over fit.

Â So instead of using different tools to solve the two problems,

Â you're using one that kind of mixes the two.

Â And this just makes the set of

Â 6:52

things you could try are more complicated to think about.

Â Rather than using early stopping, one alternative is just use L2 regularization

Â then you can just train the neural network as long as possible.

Â I find that this makes the search space of hyper parameters easier to decompose,

Â and easier to search over.

Â But the downside of this though is that you might have to try a lot of values of

Â the regularization parameter lambda.

Â And so this makes searching over many values of lambda more computationally

Â expensive.

Â And the advantage of early stopping is that running the gradient descent process

Â just once, you get to try out values of small w, mid-size w, and

Â large w, without needing to try a lot of values of the L2 regularization

Â hyperparameter lambda.

Â If this concept doesn't completely make sense to you yet, don't worry about it.

Â We're going to talk about orthogonalization in greater

Â detail in a later video, I think this will make a bit more sense.

Â Despite it's disadvantages, many people do use it.

Â I personally prefer to just use L2 regularization and

Â try different values of lambda.

Â That's assuming you can afford the computation to do so.

Â But early stopping does let you get a similar effect without

Â needing to explicitly try lots of different values of lambda.

Â So you've now seen how to use data augmentation as well as if you wish early

Â stopping in order to reduce variance or prevent over fitting your neural network.

Â Next let's talk about some techniques for

Â setting up your optimization problem to make your training go quickly.

Â