Hello everyone. This is Alex. Shawn asked me to come in and give another guest lecture. Today I'm going to talk about artificial neural networks. I'm going to try to give you an intuitive understanding of how dense neural networks actually work. Towards the end of the presentation, there will be some equations, but do not worry if you don't understand all of it because the aim today is to give you a good overview of how neural networks actually work. With that, I hope that you're excited because now we will jump into the presentation. Let us start this introduction to deep neural networks. Today we will talk about denser neural networks. That is, artificial neural network models with hidden layers, and every layer is densely connected with the previous and the next one. Throughout this presentation, we're going to use an example that illustrates how dense neural networks work. We will look at the classic deep learning example of recognizing handwritten digits. Now, if you look at these three images on this slide, many of you probably recognize that all of them are a representation of the number zero. Up until the deep learning revolution, this was a classification task that was very difficult for computers to solve, but something that now has become trivial with the introduction of deep learning models. Today we will go over one example of how it works. Just as a reminder, when we're talking about dense neural networks, then we are talking about a supervised machine learning algorithm. That means that in order to train this model, we need to have input and output pairs of data, that is x and y on the slide here. That is both observations and target values in order to train a function that then can map input values x that don't have any target to it's correct target value y. If it does this with a high-accuracy and high-performance good capabilities, then we say that this model is it generalizing well. The fundamental unit of a neural network is a perceptron. That is basically one node in our neural network. The perceptron model was introduced already in the 1950s by a researcher named Rosenblatt. It was inspired by the understanding around how biological neurons work and operate in the human brain. The perceptron model received a lot of attention when we developed more complex models with it as the fundamental building block. For example, in multilayered perceptron or neural network models. In order to explain how the perceptron model works and dense neural networks, we can go back to our example of the handwritten digit to be able to feed this image of a zero into the perceptron model. We need to convert it from a two-by-two matrix representation, where every value in the matrix is a pixel value into a one-dimensional array. If the original image is 28 by 28 pixels, then the flattened array is in total 784 pixels. This is the vector that we use as an input to our perceptron model. The perceptron model will multiply all individual input values with a weight, and here on the slide, the weights are represented by the w values to the left. Then these model values are summed up together with a bias term. We call this sum the weighted sum of the input. Each weight that we apply to the input is representative of just how important each feature is in the calculation of the results. The bias term that we add allows us to shift the decision boundary up or down accordingly. Finally, we take this weighted sum and we apply an activation function on it. We choose this activation function in order to get some decide range on the output. For classic perceptron models, the heaviside step function have been used that either can result in a zero or a one if the value exceeds a certain threshold. But there are other ones that are much more common for dense neural networks, and we will see them now in the upcoming slide. Here you can see several examples of activation functions. The main idea behind the usage of activation functions is that they should introduce non-linearity in our neural networks. Without them, the dense neural network models would basically just be a big linear regression model. In the early days of deep learning models, most network architecture used a sigmoid or a tanh activation function. However, it has been empirically proven that more simple activation functions like ReLU, rectified linear unit, are more efficient and stable during training, and they have become the de facto standard for deep-learning models. A schematic representation of a dense neural network can be found to the right here in this slide. It is a one layer neural network, also called a shallow neural network. It consists of three parts, we have an input layer, a hidden layer, and an output layer. The nodes in the hidden layer are connected with all nodes in the input layer, and they take the weighted sum of the input into their respective node. They apply an activation function and then they send the results to the output nodes. This is what is happening in the hidden layers. We can have several hidden layers in deep learning models, however here we only have one. Every connection between the layers are represented by the arrows in the picture and that is one model parameter or a model weight that we can adjust. These are the weights and the values that we tweak during the training process in order to optimize our predictions that the network carries out. The predictions are the output nodes of the network. For our example with handwritten digits, then we would have ten output nodes, and every node would represent the probability of our input image being a specific number from zero up to nine. All of these probabilities they sum up to one and the highest probability will be the prediction that we make. These are the predictions that we would like to optimize with our neural networks. How can a neural network come up with the optimized weights in order to make good predictions? Well, we have to train the model. Here on the slide you can see a step-by-step procedure that you have to follow in order to train dense neural networks. First, we need to initialize the model with random weights or our best prior assumptions of what good model weights are. Then we feed in an input example and that can be, for example, one of our handwritten digits that we saw earlier. We then run a full forward pass of the network calculating the output of one example. If it's a handwritten digit, then we would probably calculate the probability of what digit we think that it is. Then we compare that probability with the true value of what it actually is. Because we have the true value, we can calculate the loss function for the full network, so we have a value of the loss and the performance of the network. After that, we use backpropagation in order to back propagate all of the errors across every layer in the network. We calculate the loss gradient, and then we update all of our model parameters. We redo this from bullet point number 2 until we achieve a desired or acceptable performance of the network. In order to make some of these concepts even more clear, I would like to show you the most simple prediction function that we have and how we can optimize it's loss function. Here you can see as simple linear regression model where we have a slope parameter a, and an intercept parameter b. We would like to model a function that fits closely to the data that we have at hand, and you can see the data is scattered plotted in the image, and our realization of the function is a linear regression model and that is a red line. Now, we are measuring the loss of this function. We can do that in many different ways. One of the most common ones, you can see that here it is that we take the squared error of our predictions. Then basically we calculate the distance between our predicted values, those that are on the line, and the true values that we have scattered in the plot. The way that we would optimize this loss function and minimize it is that first of all, we would have some initial value and you can see that up here in the slide. This is where we start.This is our initial loss. Then we calculate the gradient of this loss value, and we take a step in the negative gradient direction in order to minimize the loss. If we would take a step in the positive gradient direction, we would increase the loss. Here we are taking the first step and we are approaching the global minimal here of the loss function. Then we take the second step in order to optimize this function in the negative gradient direction, from where we are at this point. Then we continue to iterate like this until we reach the global minimum. The way that this optimization of loss functions works in the case for neural networks, is that we make use of something called backpropagation, in order to calculate the gradient or the derivative of the loss function for the full network so that we then can make updates to the weights of the network and how we make our predictions. I'm not going to get into a lot of details here. If you're interested in how backpropagation works, there are plenty of great sources to explore online. But at a very high level, we first calculate the loss by looking at the network output and the true target value. When we have the loss, then we can compute the gradient of the full network with all of the network's activation functions and moral parameters. In order to do this, we need to make use of the chain rule that you might remember from calculus. Here on the slide, I have outlined how the chain rule actually works. Now, one way of representing the chain rule for vector-valued functions is to use Jacobian matrices. Here, the gradient of multivariate vector-valued functions could be summarized as a matrix multiplication like you can see on the slide. This trick with Jacobians can be used in order to easily calculate the full gradient of our neural network. Since we wish to optimize weights which transform the input from one layer to the next, we need to calculate the gradients of all of these nested and chained vector-valued functions. Now, these gradients can be very far away from the loss itself and high up in the network. However, this n-step chain rule calculation of the gradient can be calculated using the matrix multiplication of all of the Jacobians. When we have calculated the gradient for the network, we can see what weights that should be adjusted so that we make better predictions during the next forward pass through the network. I'm sorry if all of these equations, and mathematical notations, and expressions have been a little bit confusing. If you didn't understand all of it, that's perfectly okay. We are now going to summarize all of this. The only thing that is important is that you have a high-level overview of what is actually happening, so that you can implement all of this for your embedded devices. In order to summarize, when we train a neural network, we first carry out a forward pass feeding in one sample or a batch of samples through our network. At the end of the forward pass, when we, for example, have fed in an image of a handwritten digit to our network, we calculate the loss of the network by comparing our predictions with the actual outputs. Then, when we have the value of the loss, we can use backpropagation to traverse the steps taken to calculate the final predictions backwards. After each node, we calculate the gradient value in order to find out how to minimize the loss and the error of the network. Based on the prediction that was outputted, we can see which weights that can be adjusted for an even higher accuracy score. This is how you optimize a neural network end-to-end. We run this stepwise procedure several times until we reach a satisfactory performance of the network. Last but not least, I also want to mention that one full forward and backward pass of the network is called an iteration. Passing a subset during an iteration is called a mini-batch. While a pass across all data at once is called a batch. An apoc is a pass of the entire dataset. That was a little bit of terminology towards the end. I hope you found this presentation and this high-level overview of dense neural networks valuable. Thank you so much for listening and I'm looking forward to seeing you in future lectures.