My name is Hanlin Tang. I'm a senior algorithms engineer in Intel's AI Products Group. In this lecture, we will discuss convolutional neural networks, which is a category of neural networks that have proven very effective in area such as image recognition and classification. They are used to identify paces to power self-driving cars and particularly form an integral part in many of our robotic applications. We will start in this session by reviewing some key concepts from the previous lecture, and then provide an overview of Convolutional Neural Networks. And then we will look at some of the important component layers in a CNN such as convolution, pooling, dropout, and more. And then we'll end by providing an overview of many of the popular CNN architectures that are being used today. During the last lecture, we discussed the training procedure for deep neural networks. Take the example of digit classification, where the input to the model are the pixels of an image, and the output is a probability distribution over the 10 digits. That this figure shows only three output units, but imagine there are 10 instead. If the image corresponds to the number four, the vector y here will have all the entries zero except for the fourth, which contains a one, representing the confidence score of the model. To train a network, we slowly adjust the weights, such as the actual output of the model matches the expected output, thus causing our cost to decrease. During the last lecture, we also went over some of the key optimization algorithms used such as gradient descent, stochastic gradient descent, and other key concepts shows the momentum and back propagation. We also looked at initializers, optimizers, activations, costs and metrics used in deep learning models. These are many of the important components that form the deep learning models that we want to build. Today, we'll pay particular attention to convolution, pooling, deconvolution, dropout and local response normalization layers as these are many of the important components inside a typical convolutional neural network. In convolutional neural networks, we take this input image here, such as NxN. And then that image is passed through a sequence of convolution operations that ends up in a subsequent identification. This individual layers will learn complex features to extract, and these features become more and more sophisticated as you go deeper and deeper into the network itself. So for example, the early layer is here in Conv 1 may only be detecting simple edges in the image itself, whereas subsequent and deeper layers will begin to learn complex textures, object parts, or even finally object classes. In addition to classification, convolutional neural networks can also be used with great results in a number of applications. It is popular with any type of two dimensional image where there's a sense of space being an important consideration. So this will include not just videos and images, but also synthetic aperture radar for example, or even spectrograms where you have applaud of time and frequency. And understanding the spacial relationship between those two dimensions is important for the network to perform its task. Additionally, convolutional neural networks are also used in a variety of other applications beyond just image classification. And this example here, we have SegNet, which is a popular model to do image instance segmentation. So the input to this model is an image and the output is an image of exactly the same size, except that pixels are colored according to the particular category that that object belongs to, so building, road, sidewalk, cars or trees. Finally, convolutional networks are also often used to take an image and not just to identify where the individual objects are in that scene, but also a classified image of particular categories. So you can see some examples here on the right, where we have several images and the model learns to put bounding boxes around each individual image, but also attach a particular category to it. Here we want to motivate why you want to use convolutional neural networks. In the previous lecture, we had used a feed four network to classify images on the endless data set. So you may ask, why we need to use a different method to classify images in other data sets. If you recall in the endless data set, the input image had about 784 pixels. What that meant was that the first layer of weights had about 100,000 parameters, which is manageable for this particular model. But a lot of the natural images that we encounter are not so small. For example, a typical image would be 256 by 256 pixels or around 56,000 pixels total. That would mean that the first layer weights have about 8 million parameters. This method is unscalable to real images. You end up building networks that have so many parameters that is difficult to train with the mono data that you have. Additionally, we also want to take advantage of the spacial locality present in images. That's the intuition that nearby pixels are more correlated than pixels that are farther away. So we can use this prior knowledge to reduce the number of parameters. So what we're going to do in this example here, is take the affine layer, the linear layer that we had shown before the MNIST dataset and then we are going to delete connections. Importantly, we are going to delete connections that are consistent with the understanding that nearby elements are related together. So specifically in this case, we will delete all the connections such that each unit in the subsequent layer is connected to a local path on the input layer. This is taking advantage of our intuition of special accounting that are highly discuss previously. But the fundamental operations are exactly the same. We take those two input elements. We multiply it by a set of weights. We add a bias, passes through a long non-linear activation function to provide the output in the hidden layer. So we call these waves now filters. So why do we call them filters? Here's an example. As you recall the output Z is a weighted sum of the inputs according to our weights W1 and W2. Suppose we had a scenario where the two weights were one and minus one. So then the output value would be z, would be x1- x2. In that case, Z would be maximal when x1 and x2 are one and zero. What this means is that this filter is very good at it detecting edges in the input space where there's a large jump between x1 and x2. And that's how we mean by filters, is that this set of filters are very good at extracting certain features. And the exact filter values are learned during the training process. And so by doing this, we can take advantage of the spatial locality and reduce the number of parameters even further. What we do here is we replicate that exact same filter weight. Across the different input spaces. So we're essentially taking this filter and then tiling it across space, as you can see here, with the red and the blue colored weights here. So importantly the weights are exactly the same. They're being duplicated. And this is sort of taking advantage of our intuition that edges are important in an image regardless of where they are in the image itself. So then we have a set of filter detectors to extract important features from these images. So this is nothing more than one dimensional convolution with one filter, W1 and W2. And that filter has a size of two, meaning that there are two elements here. And we have a stride of 2, meaning that we take this filter and we move it two positions forward and duplicate it to represent the input filter for the next spatial location. Here's what one-dimensional convolution with a stride of 1 looks like. As you can see here now, the element takes in an overlapping section of the input space to form its output. Because now we're taking this filter, W1 and W2, and we're moving it up just by one position. When the number of filters is two, as with this network, we can see that the depth to the next layer is increased. If you focus on the hidden layer here, you will see that the first column of activations represents that output from the first filter. And then the second column moving out in a z direction represents the activations from applying the second filter. Additionally in this example we have a padding of one, meaning that we padded the image with zeros on its boundary. And padding allows us to control the spatial size of the output. Where generally for a one dimensional convolutional neural network, is if the input is a dimension Dx1 and if we apply C tilde filters, the depth of the next layer is equal to C tilde, the number of filters that we have. While importantly, the height depends on the stride and padding parameters that we provided here. Typically, inputs are not one dimensional, but rather two dimensional as with images. Here we use a very similar process and instead produce a three-dimensional output that has a depth equal to the number of filters. And then again, the height and the width are dependent on the stride and the padding in the input dimensions H and W. So now you can see here we have an input of an image that has a particular height and width. And then now we have two-dimensional filters, and we have a number of them. And that produces an output volume of H tilde, W tilde, and C tilde. Images are often not just grayscale, but have many channels such as RGB for each individual pixel. To accommodate this, we simply increase the depth of the filters to match that of the channel size. To imagine the convolutional network in action, consider that the filters are being applied to the pixels in the top left corner of the image. Each filter then computes a value that fills the upper left column in the output here. Here's a concrete example that runs through the exact computations that we apply when we use a two-dimensional convolutional neural network. In this example, let us apply two filters, W0 and W1, on an input that has three channels as shown here in the three different matrices. Because three-dimensional volumes are hard to visualize, the depth here is shown with the stacks of two-dimensional arrays. The blue here is the input volume and the red are the weight volumes and finally the green is the output volume. We start with a 3x3 filter, W0 here, which we apply with a stride of 2. Notice how we've applied a padding of one to the input volume, making the outer border of the input volume filled with zeroes. The output is calculated by multiplying in an element-wise fashion the highlighted input in blue, with the filter in red. Summing it up, offsetting the result by a bias can that then seize the upper left corner of the output volume. We then repeat this process with two pixels to the right to get the next value in the output. So you can see here, we're sliding the filter across the input volume, performing an element-wise multiplication, a sum and a bias to finally see the output volume. Similarly with a second filter W1, we then take this filter and we slide it through the input volume on a second pass. Now multiplying with this new set of filters to see the output volume for the second channel. The convolution operation can be implemented naively as a set of seven for loops as shown here. Of course, we develop much more efficient implementations using matrix multiplication that are not discussed in this particular lecture. As discussed earlier, in the convolutional layer, each kernel or filter shown here in red, searches for unique patterns. For example, in the lower layers, kernels may search for edges, and in higher layers, faces or other complex structures. The fascinating phenomenon in deep learning is that the kernel specialize on their own during the training phase to detect different types of features. And in many cases those features are generalizable across multiple datasets, because they capture the important statistics that exist in the natural world. The pooling layer is another important component to convolutional networks and serves to down sample the input of the model. For example, in this particular figure, each of the 2x2 region is pooled to one value by selecting the maximum value in that region. This also provides some invariance to translation, as small shifts in the network input may not affect the output of the pooling layer, given the max operation that is being used. This operation, common in the VTG network, takes as input in this case with dimensions 4x4xC and then applies a convolutional layer with a stride of 1 and a padding of 1. We then apply pooling using a 2x2 max pooling, with a stride of 2. And in this fashion we're able to maintain through convolution the size of the output space while increasing the depth of the number of filters. And then we rely on the pooling layers to reduce the size of the feature map. This is a common approach in designing convolutional neural networks such as VGG. The number of features increase as we go deeper into the network. So you can see here we start off with just 3 channels in the input space and we eventually increase to about 512 channels. But as we go deeper and deeper, the H and W decreases from 224x224 to, eventually, 7x7. And we choose specific convolution stride, and padding, and filter sizes to retain the feature map size after a convolution has taken place. So essentially what we're doing here, as we go do deeper into the network, is that we are trading off spacial information. And pushing it into a more complex feature representation, that is reflected by the larger number of channels in many of these convolutional layers. Local response normalization, or LRM, consists of normalizing values with nearby features with in or across channels. While this was used in AlexNet, it is not commonly used anymore, and it has been replaced by dropout and batch normalization as important regularization techniques. Dropout reduces overfitting, by preventing co-adaptation on the training data. During training what dropout will do it will ignore a fraction of the units, and that selection is going to be randomized from mini-batch to mini batch. So what that does is that prevents the model from relying on specific features, of individual units to drive the output. Instead, it must rely on a distribution of units and the information there in order to perform its computation. The other way to think about drop out is that, because at each mini-batch a different model, slightly different model, is being trained because you're silencing different portions of a unit. In that way, you're training very different models from mini-batch to mini-batch. And also mirrors a lot of the ensemble methods that we use in machine learning, to combine information from multiple models together. Batch Normalization allows networks to converge faster and achieve lower error. At the end of every Batch Normalization layer, the output of all the neurons are normalized to have zero mean and unit variance. This allows the network to be significantly more robust to bad initializations. And it reduces what's called the internal covariance shift. Essentially, as the model continues to learn the output statistics on a particular layer may change drastically. That is may considers a challenge for future layers who receive an input to adjust to these statistics. Instead, by inserting batch normalization layers to ensure a normalized output, we reduce this issue that future layers have to dealt with. Here's an example of batch normalization network. What we're showing here is the accuracy over training time. And you can see that if you use batch normalization you can train significantly faster. But also to a higher accuracy compared to methods that do not use batch norm. An important question is to try to understand, what the rates are running after the training process has completed. So here is some more visualization from various layers over computational neural network. You can see the first layer here, the filter weights that are being run are principally edge detectors, or they are detecting contrast and color Here's an example of these edge detectors at work applied to a particular image. You can see that edge detectors of different orientations give rise to output feature maps, that highlight edges that lie along those particular angles. As we go deeper and deeper into the network, the filter weights become much more complex, and often difficult to interpret. So you can see here in Layer 2 and Layer 3, particular combinations of edges or lines are being learned. And the model is beginning to be able to extract more complex features, patterns from the image. Deeper layers, such as these, give rise to the more complex features that are ultimately used to classify an image. As you can see here with Layer 4 and Layer 5, we began to find patterns that are specific to particular objects in the scene. Interestingly the method by which convolution neural networks interpret images, are very similar to that of the human brain. In fact, the way we architect our convolutional neural networks together, are very inspired by experiments and compuational neuroscience over the past several decades. So on the left here, you can see a model that was published in 2007, based on neuroscience experiments, that also has an alternating misshare of convolutional layers and pooling layers. And in fact, if you were to go in to a monkey brain for example, and record the filters of the early visual cortex layers. You will see the exact same filters as were learned by our convolutional neural networks in deep learning. You will still find edge detectors that are very similar between these two different modalities. Here's a plot of classification error on ImageNet, which is a popular image classification benchmark over the years. You can see in blue the error rate of the model, and then in the orange are the number of layers in that network. And you can see a significant drop in the error rate as we move in 2011, to CNN and deep learning base approaches in 2012. But it is really with the ability to stack deeper and deeper layers here as you can see moving from eight layers with AlexNet to 19 layers with VGG. And finally more recent approaches with deep residual networks, which can have hundreds or in many cases thousands of layers. That will finally reach a near human level of performance with image recognition. In this last section, I will cover many of these popular architectures that are being used today. These include AlexNet, which was one the early networks that won the ImageNet competition, and then GoogLeNet, VGG, and a deep residual networks. AlexNet was very much one of the early neural networks that proved that these deep learning techniques, which have been around for many years, finally have relevance in a practical setting. You can see here we have alternation of convolution of pooling layers, finally ending with fully connected alpine layers tat do with the classification at the end. The GoogleNet, they decided to use instead all the different types of filter sizes in the context of a single layer. And here is what it looks like. On the left here, you have the image, and you have a number of filters being applied, and those filters are then stacked together in the channel dimension to represent the output. So then the output has some image size, H and W, but then filters where contributions are given by each individual set of parameters, 5 by 5, 3 by 3, and 1 by 1 filters. The advantage here is that you can maintain the same output volume, but dramatically reduce the number of parameters. Here's an example of an inception module highlighted in red here. The output size is 14 by 14 HW, with 512 channels. And, importantly, those 512 channels receive contributions from a variety of different input configurations, 1x1 filters, 3x3 filters, 5x5 filters, and more. And this particular module has 437,000 parameters. But if we had just simply used a sort of dense 3x3 filters, as was present with other networks. Then that exact same layer would have something on the order of 2.4 million parameters. So you can see here we have not only dramatically reduced the number of parameters. But we have also incorporated different types of convolutions in the same layer. VGG demonstrated that the depth of the network was critical for good performance, and is also attractive because of its simplicity. All of the convolutions are 3x3, and there's only convolution and pooling layers that have the exact same kernel sizes. In practical experience, we found that the VGG network is much better for fine-tuning tasks compared to GoogleNet. And as I had mentioned before, the design philosophy here is that, as you go deeper into the network, the HW of the feature map decreases. But then we transform a lot of that spacial information into a more rich representation. Because the number of channels increases as we go deeper into the network. Residual networks introduced in 2015 feature skip connections that bypass certain layers. As an example, this 34-layer residual network takes the VGG 19 model, adds new layers. And then adds skip connections, which then allow the gradient to through without vanishing, an issue with extremely deep networks. Interestingly, the network does not feature pooling. But rather uses convolution with a stride of two to reduce the future amount. As networks get deeper and deeper, challenges, such as vanishing gradients, become pressing. Innovations, such as skip connections and normalization have allowed networks to progress hundreds of layers. But new techniques must be used in order to build convolutional networks that have thousands or eventually tens of thousands of layers. In neon, convolutional networks are quite easy to implement. By simply using the built in convolution pooling dropout and affine layers of neon, Alex then can be written with these lines of code.