Okay. We just had our first demo of TensorFlow.

It seems pretty straightforward, right?

We'll look at how TensorFlow creates a graph and evaluates it.

Then we implemented linear regression in TensorFlow

and compared it with linear regression in psychic learn.

And now, I have a little surprise for you.

The surprise is that in fact,

what we just implemented is

a very special limiting case of a very general class of neural network models.

In other words, to reiterate,

linear regression is a very special neural network.

Therefore, we can say that you have just implemented

your very first neural network called linear regression in TensorFlow.

Let me now explain what I mean in more details,

and why such is useful in practice.

Let's take another look at

linear regression and view it this time as a functional transform.

The transform takes real-valued inputs x,

computes their linear combination using the set of weights W,

then adds an intercept W0,

and finally adds Gaussian noise epsilon to get the final output y.

We can think of it as linear transform of inputs,

which is controlled by a set of parameters

W. These parameters can be tuned to fit your training data.

We can represent this linear transform as a kind of computational node like this one.

When our node has a number "n" of real-valued inputs, and in addition,

I added here a constant input x nought equal 1,

so that we can include the intercept w-not in the sum over x,

which now runs for i equals 0 to n.

The output of this node is

just a linear combination of inputs that entered the linear regression equation.

We can write the same thing in a bit more general form as

a function f of this whole expression.

In our particular case of linear regression,

this function is an identity with f of z equal z.

In a more general case when it's not identity,

this function has a name and it's called the activation function.

Now what if we make this function somewhat more complex, that is, non-linear.

In this case, we arrive at a non-linear regression.

The only difference from the previous case is that now the output of

the node is some non-linear transform f of a linear combination of inputs.

Such function would be referred to as a non-linear activation function.

One example of a non-linear activation function is given by a sigmoid function.

The sigmoid function is defined by this expression,

and its behavior as a function of its argument is shown here.

The sigmoid quickly approaches zero for negative values of

its argument and quickly approaches 1 for positive arguments.

We will need this function many times in this specialization, but for now,

I just want to use it for an illustration of

a non-linear transform that can be computed by such node.

Now, our node becomes

a much more versatile thing that can do non-linear transformations of its inputs.

Such construction has a name and it's called an artificial neuron.

An artificial neuron, or perceptron,

was introduced by Frank Rosenblatt in 1957 as

an extreme caricature model of

the functioning of a real physical neuron in humans brains.

A physical neuron has a number,

sometimes a huge number of axons that pass electric signals,

as its inputs, and pass its produced signals to other neurons.

In a similar way, an artificial neuron takes

its inputs and produces an output that goes somewhere else.

Now you see the reason why we called function f here activation function.

Its a function that defines the activity,

or the firing rate of this artificial neuron.

Actually, from this point and onwards,

we will be only talking about artificial neurons, not physical ones.

So, for simplicity, we will just say neuron when we mean an artificial neuron.

Okay.

So, a neuron can have different activation functions.

For example, a sigmoid, a tanh function,

or a so-called rectified linear activation function

shown here and given by a max of z and 0.

The most interesting insight comes here.

What if we send to the inputs of our neuron,

not the raw data X,

but rather some transformation of data,

which we can call h of x?

In this case, the output becomes a compound nonlinearity.

That is, it would be non-linear due to a function h and non-linear activation function f.

Such a nonlinear compound function can feed very complex input distributions because it

can be made very flexible by a proper choice of functions h and f. Okay,

so far so good.

But how do we make these transformations f, h?

And here's the answer that sounds amazingly simple,

yet it turned out to be worth

billions of dollars once exposed fully to its logical consequences.

And the answer is this,

let's just produce such transforms h using

the same types of neurons that we had just made,

and this brings us to a neural network.

What is shown here is called a feedforward neural network.

A feedforward network is a highly stylized model of how the parts of mammals brain,

called the neural cortex,

processes visual and audio signals.

How does it work?

A feedforward network is made of layers of

neurons that pass signals only in one direction from the left to the right,

or from the bottom to the top if your data is pictured.

The signals x nought to xn, three in this case,

I first copied to the input layer of the network.

The input player doesn't do anything beyond just copying the inputs.

Then the inputs pass to neurons in a human layer.

Let me call their output so these neurons by z with an upperscript one.

Neurons in the hidden layer work in the same way as we just described.

That is, they take their inputs, transform them,

and pass them to the final neuron.

Now we will call the output of these final neurons z with the upperscript two,

to emphasize that this neuron lives at the second layer of the neural network.

Please note that we don't count the input layer as a real layer,

as it only copies the inputs and doesn't do anything else.

Therefore the upperscript index of the output layer is two, and not three.

The choice of the activation function for the output layer,

z(2), depends on the problem we are trying to solve.

If you deal with a regression problem,

then a linear activation function would do the job,

and if you deal with that classification problem

then the sigmoid function is what you will need.

But no matter whether it's a regression or a classification problem,

the general construction will remain the same.

It's only the very last output neuron that would need to be replaced,

depending on what problem your neural network tries to solve.

The most important idea here is the idea of

a cascade-like transformation of data in the neural network from one layer to another.

We can extend this even further and say why just one hidden layer?

And indeed we can add one more hidden layer and

in this case we get a two-layered neural network.

We can even continue and have more than two hidden layers like three,

four, five, seven, and so on.

Any neural network with more than two hidden layers,

not counting the output layer,

is called a deep neural network.

Deep neural networks proved to be extremely powerful for many applications in tech,

such as image or face recognition,

and there are some very good reasons for

this that we will be discussing later in this course.

We will be talking a lot about deep learning in this specialization,

but for now please know that,

at least qualitatively, deep networks of this sort are not

different from non-deep shallow networks that have one or two hidden layers.

In particular, both deep and shallow neural networks are

trained using an optimization method called gradient descent.

In the next video,

we will see how it all works.