Hi. In this video, we will talk about the simplest neural network-multi-layer perceptron. But first, let's recall linear binary classification. In this task, we have features x_1 and x_2, we have target y, which could be plus, minus one, is a binary classification after all. We try to find the decision function d, which is a linear combination of our features. How we make prediction. We take a sign of that decision function, and why do we do that? Let's look at the image. Here, we have some precedence, pluses and minuses, what we try to do is we try to find a separating line, and all the points that lie exactly on that line satisfy equation d of x equals zero. If you take any point above that line, you will have a positive d of x. If you take any point below that line, you will have a negative d of x. So, applying a sign function to our decision function, we can make a prediction. Now, let's move to another task. It's called logistic regression, but it is really very similar. It predicts probability of the positive class. So, it just doesn't don't put, minus one but it also outputs probability of the plus one. In this task, we have a decision function that looks exactly the same, but we make a decision based on that decision function differently. We replace a sign function with a sigmoid function that looks like this, and sigmoid function is a cool function, because it can transform any value to the range from zero to one. So, it can output valid probabilities. Why does it work? What is the intuition behind that? You can actually see on the image, that d of x not only gives you the sign of that linear combination, but it also gives you the distance of that point from our separating line. You can convert that distance into confidence. Because if you have a point that lies around the line, then we are somewhere on the border of classes, and we're not that confident whether it's plus or minus, maybe it's just a noise. But if a point is far from the line, then we're pretty sure that this is a plus. That is what sigmoid function actually does. So, if decision function yields zero, that means that our sigmoid function gives you 0.5. So, exactly on the line, we are not sure whether that is a plus or minus. But if you go far from that line, then you will have a large sigmoid values, and that is where we become more confident in that this is the plus one. But, let's look at this task. Okay. So, linear models are simple. We just review them to have some common ground. But this is a little bit more complicated. Because you cannot cut it with a line you have to come up with something else. But I want to show you that you can actually build a lot of lines here, and they will help you to solve this problem. For example, let's build the line z_1. It doesn't try to solve the original problem ideally. All it does, is it tries to solve a sub-problem of separating minuses on the left. So, we can actually use that prediction as a valuable feature for further classification because it tells us whether the minus is on the left or on the right so, it's a valuable information. So, what I would suggest now is to use three of those lines because you can see them already with your eyes. So, we built lines z_1, z_2, and z_3, each of them is a logistic regression. What we do next? Let's use the predictions of these lines as new features in our machine learning task. How does it look? We take a point which is in green border here, and let's see how it translates to our new coordinates z_1, z_2, and z_3. Let's see what prediction we have with respect to z_1. So, z_1 is a logistic regression. If you take a point that lies exactly on the line, you will have a prediction of 0.5. Right? But if you go above the line for that sub-problem, you will have a probability of a plus that equals roughly 0.6. So, then let's move to z_2 for example. With respect to that line, we're lower than that line, and the probability of a plus actually vanishes. It becomes 0.3. The last one, is z_3. With respect to that line, we have a huge probability of a plus, because we are far away from that line, and we're pretty sure that that is a plus. So, what have we done? We actually translated our point from two coordinates, x_1 and x_2, into three coordinates, z_1, z_2 and z_3. We believe that those three features might be much more valuable for our classification task. So, we just add our y, our target vector to the task and try to solve it. Right? But what model do we use here? What model we should apply here? Let's keep it as simple as possible. Let's try to solve it with a linear model as well. So, let's try to build a final linear model on top of these new features, z_1, z_2, and z_3. This looks pretty neat. But we still don't know how to find these four lines. Right? Imagine that we know them. Imagine that we know all the coefficients that we need to make the prediction, then the prediction is made easily. So, we just apply three linear models, get z_1, z_2, and z_3. Then, apply final linear model to get our final prediction. Let's rewrite these function composition into a so-called computational graph. This graph contains of nodes that correspond to computed variables like x_1, x_2, or A, and ddges correspond to dependencies. So, if we have an edge coming from x_1 to z_1, that means that we need x_1 to compute the value of z_1. So, you can validate that our function composition translates to these computational graph. As a matter of fact, our computational graph has a name. It's called Multi-layer perceptron. You can see that it has some layers. The first layer is called an input layer. It usually contains features that we have as an input. Then, the first hidden layer, it contains z_1, z_2, and z_3, is called the first hidden layer. Each node here is called the neuron. We will call a neuron anything that takes a linear combination of its inputs and apply some non-linear activation function. For example, sigmoid activation. Then, we have an output layer that contains our prediction and this is it. We should know these layers because we will distinguish them. Why is it called a neuron then? So, why this name? When people came up with that name, they already knew that our brain contains neurons. A neuron is some complex cell that gets some signals from other cells like it, and then, based on some difficult logic inside of that cell, it decides whether to output a signal or not and that signal is transmitted to other neurons like it. So, we need a complex cell that have inputs and that outputs something based on some difficult rule. So, we need to approximate that. Right? We need some mathematical neuron. We need some artificial neuron. Lets approximate it with logistic regression. Why not? Let's see how it is similar to a neuron in the human brain. We have inputs x_1, x_2 and one. Then, we have some weights. We multiply our inputs by that weight, and take a sum of them. Based on that sum, we decide whether to output a signal or not. Here we apply sigmoid activation function. Why is it called activation function? Because you can look on the left and see that a sigmoid function is a smooth indicator approximation. So, we have an indicator function that says that, okay, if a value is positive, then output one, otherwise output zero. This is indicator. But here, we try to smooth that indicator so that our problem becomes differentiable. This is exactly why it is called activation function. Because, it allows our artificial neuron to activate when a certain sum is computed. So, you can think of an artificial neuron as something that is correlation activated. So, when it sees an input that is similar to the pattern that it tries to find in a data, then it has a huge activation in output. You need to notice one thing though, we need that non-linear activation functions. So, we can't make that neuron simpler than we have already discussed. Let's see what happens if we throw away that sigmoid activation. Let's take a simple neuron, actually we have a multi-layer perceptron here with one hidden layer. If you look at those equations, you can actually come up with an idea what our neuron will be. Yes, you've got it right. Our algorithm turns into a fancy linear function. So, if we strip away those sigmoid activations, then it actually group those summons, and see that it is no better than a linear model. So, we need activation functions. We can't make our neuron simpler than that. So, this thing that we have overviewed is called MLP, and it is a simplest example of artificial neural networks. Of course MLP can have more than one hidden layer, and the number of hidden layers, and the number of neurons in each hidden layer, and a choice of activation function, all constitutes so-called architecture of all MLP. So, you have to come up with those values to define an architecture. A hidden layer in MLP is called a dense layer, or a fully-connected layer, because you can see that all neurons in neighboring layers are connected to each other. Okay. But how do we train our MLP? We know how to train one neuron which could be a logistic regression, and we train logistic regression with stochastic gradient descent right? Let's try to do the same thing here but for the whole MLP. Why can we do that? Because of the choice of the functions, our each neuron is just a linear combination which is differentiable, and if we take an activation function which is also differentiable, then the whole composition, this whole complex function is still differentiable, so you can apply as you do. The problem here though is that, the loss curve might have a lot of local optimum's and you can actually stuck in them like Bob in these image and you have sub-optimal solution. This might happen and as a matter of fact it will happen in practice and we still don't know what to do with that. The best decision is just to start over, to start from some other point and luckily, you will be in the global optimum. You can actually check out how MLP training works, how SUD is applied just in your browser. Please stop the video, go to that website and try to play around with architectures and try to solve that complex problem with MLP, and see how neuron network convergence to that solution, this is pretty cool. So, the main problems that are still there in training of our MLP is the following. First, we can have many hidden layers, right? It's a hyperparameter it constitutes our architecture, and that means that we need to calculate our gradients automatically, no matter what architecture we have. So, we need to code that algorithm that can do that for any number of hidden layers, that is the first problem. The second problem is that we can have many neurons. In fact, in modern architectures, you can have a billion of them, and we need to calculate gradients fast, because if you have a billion neurons, then you have to do that fast. In the next video, we will solve these problems.