0:00

We've all been hearing that deep neural networks work really well for

Â a lot of problems, and it's not just that they need to be big neural networks,

Â is that specifically, they need to be deep or to have a lot of hidden layers.

Â So why is that?

Â Let's go through a couple examples and try to gain some intuition for

Â why deep networks might work well.

Â So first, what is a deep network computing?

Â If you're building a system for face recognition or

Â face detection, here's what a deep neural network could be doing.

Â Perhaps you input a picture of a face then the first layer of the neural network

Â you can think of as maybe being a feature detector or an edge detector.

Â In this example, I'm plotting what a neural network with maybe 20 hidden units,

Â might be kind of compute on this image.

Â So the 20 hidden units visualize by these little square boxes.

Â So for example, this little visualization represents a hidden unit is

Â trying to figure out if where the edges of that orientation in DMH.

Â And maybe this hidden unit maybe trying to figure

Â out where are the horizontal edges in this image.

Â And when we talk about convolutional networks in a later course,

Â this particular visualization will make a bit more sense.

Â But the form, you can think of the first layer of the neural network as look at

Â the picture and try to figure out where are the edges in this picture.

Â Now, let's think about where the edges in this picture by grouping together

Â pixels to form edges.

Â It can then de-detect the edges and group edges together to form parts of faces.

Â So for example, you might have a low neuron trying to see if it's finding an I,

Â or a different neuron trying to find that part of the nose.

Â And so by putting together lots of edges,

Â it can start to detect different parts of faces.

Â And then, finally, by putting together different parts of faces,

Â like an eye or a nose or an ear or a chin, it can then try to recognize or

Â detect different types of faces.

Â So intuitively, you can think of the earlier layers of the neural network as

Â detecting simple functions, like edges.

Â And then composing them together in the later layers of a neural network so

Â that it can learn more and more complex functions.

Â These visualizations will make more sense when we talk about convolutional nets.

Â And one technical detail of this visualization,

Â the edge detectors are looking in relatively small areas of an image,

Â maybe very small regions like that.

Â And then the facial detectors you can look at maybe much larger areas of image but

Â the main addition while you take away from this is just finding simple things

Â like edges and then building them up.

Â Composing them together to detect more complex things like an iron and

Â then composing those together to find more even complex things.

Â And this type of simple to complex hierarchical representation,

Â or compositional representation,

Â applies in other types of data than images and face recognition as well.

Â For example, if you're trying to build a speech recognition system,

Â it's hard to revisualize speech but

Â if you input an audio clip then maybe the first level of a neural network might

Â learnt to detect low level audio wave form features, such as is this toe going up?

Â Is it going down?

Â Is it white noise or slithering sound like [SOUND].

Â And what is the pitch?

Â When it comes to that, detect low level wave form features like that.

Â And then by composing low level wave forms,

Â maybe you'll learn to detect basic units of sound.

Â In linguistics they call phonemes.

Â But, for example, in the word cat, the C is a phoneme, the A is a phoneme,

Â the T is another phoneme.

Â But learns to find maybe the basic units of sound and

Â then composing that together maybe learn to recognize words in the audio.

Â And then maybe compose those together,

Â in order to recognize entire phrases or sentences.

Â So deep Internet work with multiple hidden layers might be able to have the earlier

Â layers learn these lower level simple features and

Â then have the later deeper layers then put together the simpler things it's detected

Â in order to detect more complex things like recognize specific words or

Â even phrases or sentences.

Â The uttering in order to carry out speech recognition.

Â And what we see is that whereas the other layers are computing, what seems like

Â relatively simple functions of the input such as right at the edges, by the time

Â you get deep in the network you can actually do surprisingly complex things.

Â Such as detect faces or detect words or phrases or sentences.

Â Some people like to make an analogy between deep neural networks and

Â the human brain, where we believe, or neuroscientists believe,

Â that the human brain also starts off detecting simple things like edges in what

Â your eyes see then builds those up to detect more complex

Â things like the faces that you see.

Â I think analogies between deep learning and

Â the human brain are sometimes a little bit dangerous.

Â But there is a lot of truth to, this being how we think that human brain works and

Â that the human brain probably detects simple things like edges and

Â then put them together to from more and more complex objects and so that

Â has served as a loose form of inspiration for some people learning as well.

Â We'll see a bit more about the human brain or

Â about the biological brain in the lead of video this week.

Â 5:35

The other piece of intuition about why deep networks seem to

Â work well is the following.

Â So this result comes from circuit theory of which pertains the thinking

Â about what types of functions you can compute with different logic case.

Â So informally, their functions compute with a relatively small but deep neural

Â network and by small I mean the number of hidden units is relatively small.

Â But if you try to compute the same function with a shallow network,

Â so hidden layers,

Â then you might require exponentially more hidden units to compute.

Â So let me just give you one example and illustrate this a bit informally.

Â But let's say you're trying to compute the exclusive [INAUDIBLE] or

Â the parity of all your input features.

Â So you're trying to compute X1, XOR, X2, XOR,

Â X3, XOR, up to Xn if you have n or n X features.

Â So if you build in XOR free like this, so for us it computes the XOR of X1 and

Â X2, then take X3 and X4 and compute their XOR.

Â And technically, if you're just using ends or not gauge, you might need

Â couple layers to compute the XOR function rather than just one layer, but

Â with a relatively small circuit, you can compute the XOR, and so on.

Â And then you can build, really, an XOR tree like so,

Â until eventually, you have a circuit here that outputs, well, lets call this Y.

Â The outputs of Y hat equals Y.

Â The exclusive or the parity of all these input bits.

Â So to compute XOR, the definite left network will be on the order of log N.

Â We'll just have an XOR tree.

Â So the number of nodes or the number of circuit components or

Â the number of gates in this network is not that large.

Â You don't need that many gates in order to compute the exclusive OR.

Â But now, if you are not allowed to use a neural network with north pole

Â hidden layers with, in this case, order log and hidden layers,

Â if you're forced to compute this function with just one hidden layer,

Â so you have all these things going into, so the hidden units.

Â And then these things then output Y.

Â Then in order to compute this XOR function, this hidden layer

Â will need to be exponentially large, because essentially,

Â you need to exhaustively enumerate our 2 to the N possible configurations.

Â So on the order of 2 to the N, possible configurations of the input

Â bits that result in the exclusive [INAUDIBLE] being either 1 or 0.

Â So you end up needing a hidden layer that is exponentially large in

Â the number of bits.

Â I think technically, you could do this with 2 to the N minus 1 hidden units.

Â But that's the older 2 to the N processes explanation larger the number of bits.

Â So I hope this gives a sense that there are mathematical functions,

Â that are much easier to compute with deep networks than with shallow networks.

Â Actually, I personally found the result from circuit theory less useful for

Â gaining intuitions, but just one of the results that people often

Â cite when explaining the value of having very deep representations.

Â Now, in addition to this reasons for

Â preferring deep neural networks to be roughly on,

Â is I think the other reasons the term deep learning has taken off is just branding.

Â This things just we call neural networks belong to hidden layers, but

Â the phrase deep learning is just a great brand, it's just so deep.

Â So I think that once that term caught on that really new networks rebranded or

Â new networks with many hidden layers rebranded,

Â help to capture the popular imagination as well.

Â They regard as the PR branding deep networks do work well.

Â Sometimes people go overboard and insist on using tons of hidden layers.

Â But when I'm starting out a new problem, I'll often really start out with

Â neuro-logistic regression then try something with one or

Â two hidden layers and use that as a hyper parameter.

Â Use that as a parameter or hyper parameter that you tune in order to try to find

Â the right depth for your neural network.

Â But over the last several years there has been a trend toward people finding that

Â for some applications, very, very deep neural networks here with maybe many

Â dozens of layers sometimes, can sometimes be the best model for a problem.

Â So that's it for the intuitions for why deep learning seems to work well.

Â Let's now take a look at the mechanics of how to implement not just front

Â propagation, but also back propagation.

Â