0:00

[MUSIC]

Â Hi. In this video,

Â we'll discuss linear models.

Â One of the simplest models in machine learning, but linear models are building

Â blocks for deep neural networks that we will discuss in our course.

Â So, they are quite important for us and let's start with an example.

Â Suppose you are given an image and the goal is to count sea lions on this image,

Â this aerial world problem that was hosted on kegel.com.

Â So we want to write a program, a function a that takes an image as an input,

Â counts sea lions on it and counts the number of sea lions on the image.

Â Of course, we can come up with some heuristics like detect edges of

Â the objects on this photograph and try to count connected components.

Â But this approach is inferior to machine learning.

Â In machine learning, we try to collect a labelled set of images.

Â 1:16

Let's give some definitions that will be very useful for us.

Â An image or

Â any other object that we try to analyze in machine learning is called an example.

Â And if it's an example that we try and model, it's a training example.

Â We describe each example with deep characteristics that we call features.

Â For example, for an images, features are intensities of every pixel on the image or

Â something else.

Â So, we have examples.

Â And in supervised learning, we have target values.

Â We have a grand truth and answer for each example.

Â For example, in the problem of count in sea lions,

Â we have a number of sea lions for every example, for every image.

Â We denote this target values by y.

Â So for example, xi.

Â The target value is yi.

Â As I said in machine learning, we tried to collect a set of label examples.

Â We denoted by X large and it's a set of pairs of L pairs that have

Â an example with its feature description, and target value.

Â And finally, we want to find a modal,

Â a function that maps examples to ta get values.

Â We denoted by a(x) model or hypothesis and the goal of machine

Â learning is to find a modal that fits the training set x by the best way.

Â There are two main classes of supervised learning problems,

Â regression and classification.

Â In regression, the target value is a real value.

Â For example, if we count sea lions, the target value is real.

Â Actually, it's natural numbers.

Â But it's also regression or, for example, if given a job description and

Â we try to predict what salary will be given on this job.

Â That's also regression since salary is a real value.

Â Or for example, if we're given movie review from some user who tried

Â to determine what rating will users give to this movie on a scale from one to five.

Â It's also can be solved as a regression problem.

Â On the other hand, if the number of target failures is finite,

Â it's a classification task.

Â For example, if we want to recognize some objects on images, for example,

Â we want to find out whether there are cats or dogs or grass or

Â maybe clouds or bicycle on the image.

Â It's an object recognition task.

Â Since number of answers is finite, there is finite number of objects,

Â then we are solving classification task.

Â Or for example, if we are analyzing news articles and

Â want to find out what topic this article belongs to, is it about politics or

Â sports or entertainment, then it's also classification tasks since

Â number of target values is once again, finite.

Â 4:04

Let's discuss this very simple dataset.

Â Each object, each example is described with one feature and

Â you have real value target.

Â Here is the dataset, so we can see that there is a linear trend.

Â If feature increases two times,

Â then target decreases somewhere about two times.

Â So maybe we could use some linear model to describe this data,

Â to build a predictive model.

Â Here's linear model.

Â It's very simple and has just two parameters, w1 and w0.

Â And if you find best weights, w1 and w0, then we'll have a model like this one.

Â It describes data very well.

Â It isn't perfect.

Â It doesn't predict the exact target value for each example, but

Â it fits the data quite well.

Â Of course, in most machine learning tasks, there are many features.

Â So, we can use a generic linear model like this one.

Â So it takes each feature, x, j multiplies it by weight wj.

Â Sums these multiplicates of all the features and then adds a biased term, b.

Â 5:15

This is a linear model.

Â It has d+1 parameters where d is the number of features in our dataset.

Â There are d weights or coefficients and one bias term, b.

Â It's a very simple model.

Â Because for example, neural networks have much more parameters for

Â the same number of features.

Â And to make it even simpler, we'll suppose that in every sample,

Â there is a fake feature that will always have a value of one.

Â So, a coefficient with this feature is a bias.

Â So in the following slides, we don't analyze a bias separately.

Â We suppose it is among the weights.

Â It would be very convenient to write our linear model in vector form.

Â So, it's known from linear algebra that dot product is exactly what you've

Â written on the previous slide.

Â It's multiples of vectors and then we sum it up.

Â So, our linear model is basically a dot product of weight vector and

Â feature vector X.

Â And if we want to apply our model to a whole training set or

Â maybe to other set of examples, then we do the following thing.

Â We form a matrix with our sample.

Â Matrix is X large.

Â It has L rows and d columns.

Â Each row corresponds to one same, to one example and

Â each column corresponds to values of one feature on every example.

Â Then to apply our model to this set X large,

Â we multiply matrix X by vector w and that's our predictions.

Â This multiplication will give us the vector of size L and

Â each component is a prediction of our linear model or each example.

Â 7:08

One of the most popular choices for

Â loss function in regression is mean squared error.

Â It goes like this.

Â We take a particular example, Xi, for example.

Â We calculate a prediction of our model for this example for

Â the linear model is that product of w and Xi, then we subtract target value.

Â So we calculate deviation of target value from predictive value,

Â then we take a square of it and

Â then we average these squares of deviations over all our training set.

Â This is mean squared error.

Â It measures how well our model fits the data.

Â The less mean squared error, the better the model fits the data.

Â And of course, we can write mean squared error in vector form.

Â We multiply matrix X by vector w and we have a vector of predictions for

Â all the examples in the set, then we subtract vector of target values

Â of real answers and then we take euclidean norm of this vector.

Â That is the same as the mean squared error I described before.

Â 8:38

Actually, if you do some calculus, if you take derivatives and solve the equations,

Â then you'll have the analytical solution for these optimization problems.

Â It goes like this, but it involves inverting and matrix.

Â It is a very complicated operation.

Â And if you have more than 100 or 1,000 features,

Â then it's very hard to find an inverse matrix exposed by extra supposed X.

Â We can reduce this problem to solve it as a system of linear equations, but

Â it's still quite hard and requires lots of computational resources.

Â So later, we'll try to find a framework for

Â better, more scalable optimization of such problems.

Â In this video, we discussed linear models for regression.

Â They are very simple, but they are very useful for deep neural networks.

Â We discussed mean squared error, a loss function for regression problems.

Â And found out that it has analytical solution, but it's not very good and

Â it's hard to compute.

Â So in following videos, we'll try to find a better way to optimize such models.

Â But first of all in the next video,

Â we'll discuss how to apply linear models in classification tasks.

Â [MUSIC]

Â