Hi, my name is Andrey. This week, you will learn how to solve computer vision tasks with narrow networks. You already know about Multi Layer Perceptron that has lots of hidden layers. In this video, we will introduce a new layer of neurons specifically designed for image input. What is image input? Let's take an example of a gray-scale image. It is actually a matrix of pixels or picture elements. Dimensions of these metrics are called image resolution. For example, it can be denoted as 300 by 300. Each pixel stores its brightness or intensity, ranging from zero to 255. Zero intensity corresponds to black color. You can see that, on example of giraffe image, we have a close up of a left ear, where black colors correspond to roughly zeros, and light colors are close to 255. Color images store pixel intensities for three different channels: red, green, and blue. You know that neural networks like when the inputs are normalized. Now you know that image is just a matrix of numbers so let's normalize them, Let's divide by 255 and substract a 0.5. This way, we will have zero mean and our numbers are normalized. What do we do next? You already know about Multi Layer Perceptron, right? So what if we use it for the same task? We take our pixels, which are green nodes here, and for each of these pixels, we train a weight W. Our perceptron will take all that inputs, multiply by weights, add a bias term, and pass through activation function. It seems like we can use it for images, right? But actually it doesn't work like that. Let's see on example. Let's say we want to train a cat detector, on these training image where we have a cat in the lower right corner, red weights will change during back propagation to better detect a cat, but let's take a different example, where we have a cat in the upper left corner. Then, green weights will change, and what is the problem here? The problem is that we learn the same cat features in different areas, and hence, we don't fully utilize the training set. The red weights are only trained on the images, where we have a cat in that corner, as well as for the green weights. What if cats in the test set to appear in different places? Then, our neurons are just not ready for that. Luckily, we have convolutions. Convolution is a dot product of a kernel, or a filter, and a patch of an image, which is also called a local receptive field of the same size. Let's see an example of how it works. We have an input, which can be an image, and we have a sliding window, which has a red border. Let's extract a first patch, a local receptive field. We multiplied by a kernel. We're taking actually a dot product and what we get here, we have one plus four, which is five. Then we slide that window across the image, and for all possible locations, we take a dot products with a kernel. For example, somewhere in the middle of the road, we can have this convolution. We can have our patch 1101 and if we take a dot product with a kernel, then we will have 1+2+4, which is seven. Actually, convolutions have been used for a while. Let's see an example. We have an original image, and we have a kernel, which has an eight in the center, and all the rest on minus ones. How does it work? Actually, it will sum up to zero, which corresponds to black color when the patch is a solid fill. When all the inputs of our patch of the same color, then we will have zero. Actually, it works like an edge detection because anywhere where we have an edge, which is contrary to the solid fill, we will have non-zero activation. Another example is a sharpening filter. It has a five in the center and minus ones on the north, west, east, and south so it doesn't sum up to zero, and it doesn't work like edge detection. For solid fills, it actually outputs the same color, but when we have an edge, it adds a little bit of intensity on the edges because it is somewhat similar to the edge detection kernel. That's why we perceive it as an increase in sharpening. The last but not least is a simple convolution which takes an average of its inputs, and this way we'll lose details and it acts like blurring. Convolution is actually similar to correlation. Let's take an input where we have a backwards slash painted on that image, and if we try to convolve it with a kernel that looks like a backwards slash, then for two locations of our sliding window, we will have non-zero dot product. They're denoted by a red border here. In the output, we have one and two, and all the rest are zeros. If we take a different image, where our slash is not a backslash, but a forward slash, and we convolve it with the same pattern, a kernel of a backslash, then, in the output, we will have something like this, which are two activations of one and the rest are zero. What can we see here? If we take the maximum value of activations from our convolutional layer, for the first example it will become two, and for the second one, one. Actually, it looks like we've made a simple classifier of slashes of backslashes on our image. Another interesting property of convolution is translation equivariance. It means that if we move the input, if we translate the input, and imply convolution, it will actually act the same as if we first applied convolution, and then translated an image. Let's look at the example. We moved our backslash on the image, and convolution result. Actually, in the result, we have the same numbers but they're translated. So if we try to take the maximum of these outputs it will actually stay the same so it looks like our simple classifier is invariant to translation. How does convolutional layer in neural network works? First, we have an input which can be an image, and we add so-called padding. It is denoted as gray area and it is necessary, so that our convolution result will have the same dimensions. Okay, let's look how it works. Well take the first batch of three by three from our image with padding. And if we try to take a dot product with our kernel, which have weights that we need to train, from W1 to W9, then we will have in the output, W6 plus W8, plus W9 plus a biased term, and then we apply activation function, which can be sigmoid. If we move that window, then we will get a different neuron, right? The stat with which we move that window is actually called a stride. In this example, we have a stride of one, and we have a new output, which is W5 plus W7 plus W8, and notice that W8 is re-used, we actually shared that weight, plus bias term, and then we apply the sigmoid activation. In the result, if continue to do that for all our output neurons, we will have a so-called feature map, and it has a dimension, the same as the input image, three by three, and we employed only 10 parameters to calculate it. How does back propagation work for convolutional neural networks? Let's look at this simple example. We have a three by three inputs, and we have a two by two convolution, which means that we have four weights that we have to train. Let's take our first batch from the image which has a purple border. The weight that corresponds to W4, let's denote it with B. If we move it to the right, then the next batch will have the parameter A for the weight of the W4. Actually, we have four different locations where W4 is used. Let's assume that that is not W4, but different parameters. How will back propagation work then? We get to the gradients. DL, DA, DL, DB, and so forth, and we have to make a step in the direction opposite to the gradient, right? If we look at all these update rules then you can see that we were updating A, B, C, and D, with some rule, but actually, A, B, C, and D, are the same parameter W4, because we shared it in convolutional layer. That means that we're effectively changing the value of W4 and the step that we make is equal to the sum of the gradients for all the parameters, A, B, C and D. That's how back propagation works for convolutional layer. We just sum up their weights, we just sum up the gradients for the same shared weight. In convolutional layer, the same kernel is used for every output neuron, and that way, we share parameters of the network and train a better model. Remember the cat problem, when it appeared in different regions of the image? With convolutional layer, we will train the same cat features no matter where we have a cat. Let's look at the example, we have a 300 by 300 the input, the same size of output and five by five convolutional kernel. In convolutional layer, we will have a lot 26 parameters to train, and if we want to make it a fully connected layer, where each output is a perceptron, then we will need eight billion of parameters. That is too much. Convolutional layer can be viewed as a special case of a fully connected layer where, when all the weights outside the local receptive field of each neuron, equals zero, and kernel parameter are shared between neurons. To wrap it up, we have introduced it convolutional layer, which works better than fully connected layer for images. This layer will be used as a building block for large and narrow networks. In the next video, we will introduce one more layer that we will need to build our first fully working convolutional network.