0:00

In the last video,

you learned how to use a convolutional implementation of sliding windows.

That's more computationally efficient,

but it still has a problem of not quite outputting the most accurate bounding boxes.

In this video, let's see how you can get

your bounding box predictions to be more accurate.

With sliding windows, you take

this three sets of locations and run the crossfire through it.

And in this case,

none of the boxes really match up perfectly with the position of the car.

So, maybe that box is the best match.

And also, it looks like in drawn through,

the perfect bounding box isn't even quite square,

it's actually has a slightly wider rectangle or slightly horizontal aspect ratio.

So, is there a way to get this algorithm to outputs more accurate bounding boxes?

A good way to get this output more accurate bounding boxes is with the YOLO algorithm.

YOLO stands for, You Only Look Once.

And is an algorithm due to Joseph Redmon,

Santosh Divvala, Ross Girshick and Ali Farhadi.

Here's what you do.

Let's say you have an input image at 100 by 100,

you're going to place down a grid on this image.

And for the purposes of illustration,

I'm going to use a 3 by 3 grid.

Although in an actual implementation,

you use a finer one,

like maybe a 19 by 19 grid.

And the basic idea is you're going to take

the image classification and localization algorithm that you saw a few videos back,

and apply it to each of the nine grids.

And the basic idea is you're going to take the image classification and

localization algorithm that you saw in the first video of

this week and apply that to each of the nine grid cells of this image.

So the more concrete,

here's how you define the labels you use for training.

So for each of the nine grid cells, you specify a label Y,

where the label Y is this eight dimensional vector,

same as you saw previously.

Your first output PC 01 depending on whether or

not there's an image in that grid cell and then BX,

BY, BH, BW to specify the bounding box if there is an image,

if there is an object associated with that grid cell.

And then say, C1, C2, C3,

if you try and recognize three classes not counting the background class.

So you try to recognize pedestrian's class,

motorcycles and the background class.

Then C1 C2 C3 can be the pedestrian,

car and motorcycle classes.

So in this image,

we have nine grid cells,

so you have a vector like this for each of the grid cells.

So let's start with the upper left grid cell,

this one up here.

For that one, there is no object.

So, the label vector Y for the upper left grid cell would be zero,

and then don't cares for the rest of these.

The output label Y would be the same for this grid cell,

and this grid cell, and all the grid cells with nothing,

with no interesting object in them.

Now, how about this grid cell?

To give a bit more detail,

this image has two objects.

And what the YOLO algorithm does is it takes the midpoint of reach of

the two objects and then assigns the object to the grid cell containing the midpoint.

So the left car is assigned to this grid cell,

and the car on the right,

which is this midpoint,

is assigned to this grid cell.

And so even though the central grid cell has some parts of both cars,

we'll pretend the central grid cell has no interesting object so that

the central grid cell the class label Y also looks like this vector with no object,

and so the first component PC,

and then the rest are don't cares.

Whereas for this cell,

this cell that I have circled in green on the left,

the target label Y would be as follows.

There is an object,

and then you write BX, BY, BH,

BW, to specify the position of this bounding box.

And then you have, let's see,

if class one was a pedestrian, then that was zero.

Class two is a car, that's one.

Class three was a motorcycle, that's zero.

And then similarly, for the grid cell on

their right because that does have an object in it,

it will also have some vector like

this as the target label corresponding to the grid cell on the right.

So, for each of these nine grid cells,

you end up with a eight dimensional output vector.

And because you have 3 by 3 grid cells,

you have nine grid cells,

the total volume of the output is going to be 3 by 3 by 8.

So the target output is going to be 3 by 3 by 8 because you have 3 by 3 grid cells.

And for each of the 3 by 3 grid cells,

you have a eight dimensional Y vector.

So the target output volume is 3 by 3 by 8.

Where for example, this 1 by 1 by 8 volume in

the upper left corresponds to

the target output vector for the upper left of the nine grid cells.

And so for each of the 3 by 3 positions,

for each of these nine grid cells,

does it correspond in eight dimensional target vector Y that you want to the output.

Some of which could be don't cares,

if there's no object there.

And that's why the total target outputs,

the output label for this image is now itself a 3 by 3 by 8 volume.

So now, to train your neural network,

the input is 100 by 100 by 3,

that's the input image.

And then you have a usual convnet with conv,

layers of max pool layers, and so on.

So that in the end,

you have this, should choose the conv layers and the max pool layers,

and so on, so that this eventually maps to a 3 by 3 by 8 output volume.

And so what you do is you have an input X which is the input image like that,

and you have these target labels Y which are 3 by 3 by 8,

and you use map propagation to train the neural network to map

from any input X to this type of output volume Y.

So the advantage of this algorithm is that

the neural network outputs precise bounding boxes as follows.

So at test time,

what you do is you feed an input image X

and run forward prop until you get this output Y.

And then for each of the nine outputs

of each of the 3 by 3 positions in which of the output,

you can then just read off 1 or 0.

Is there an object associated with that one of the nine positions?

And that there is an object, what object it is,

and where is the bounding box for the object in that grid cell?

And so long as you don't have more than one object in each grid cell,

this algorithm should work okay.

And the problem of having multiple objects within

the grid cell is something we'll address later.

Of use a relatively small 3 by 3 grid,

in practice, you might use a much finer,

grid maybe 19 by 19.

So you end up with 19 by 19 by 8,

and that also makes your grid much finer.

It reduces the chance that there are multiple objects assigned to the same grid cell.

And just as a reminder,

the way you assign an object to grid cell as

you look at the midpoint of an object and then

you assign that object to whichever one grid cell contains the midpoint of the object.

So each object, even if the objects spends multiple grid cells,

that object is assigned only to one of the nine grid cells,

or one of the 3 by 3,

or one of the 19 by 19 grid cells.

Algorithm of a 19 by 19 grid,

the chance of an object of two midpoints of

objects appearing in the same grid cell is just a bit smaller.

So notice two things, first,

this is a lot like the image classification and

localization algorithm that we talked about in the first video of this week.

And that it outputs the bounding balls coordinates explicitly.

And so this allows in your network to output bounding

boxes of any aspect ratio, as well as,

output much more precise coordinates that aren't just

dictated by the stripe size of your sliding windows classifier.

And second, this is

a convolutional implementation and you're not implementing this algorithm nine

times on the 3 by 3 grid or if you're using a 19 by 19 grid.19 squared is 361.

So, you're not running the same algorithm 361 times or 19 squared times.

Instead, this is one single convolutional implantation,

where you use one consonant with a lot of shared computation between

all the computations needed for all of your 3 by 3 or all of your 19 by 19 grid cells.

So, this is a pretty efficient algorithm.

And in fact, one nice thing about the YOLO algorithm,

which is constant popularity is because this is a convolutional implementation,

it actually runs very fast.

So this works even for real time object detection.

Now, before wrapping up,

there's one more detail I want to share with you,

which is, how do you encode these bounding boxes bx, by, BH, BW?

Let's discuss that on the next slide.

So, given these two cars,

remember, we have the 3 by 3 grid.

Let's take the example of the car on the right.

So, in this grid cell there is an object and so the target label y will be one,

that was PC is equal to one.

And then bx, by,

BH, BW, and then 0 1 0.

So, how do you specify the bounding box?

In the YOLO algorithm, relative to this square,

when I take the convention that the upper left point here is

0 0 and this lower right point is 1 1.

So to specify the position of that midpoint,

that orange dot, bx might be,

let's say x looks like is about 0.4.

Maybe its about 0.4 of the way to their right.

And then y, looks I guess maybe 0.3.

And then the height of the bounding box is specified as

a fraction of the overall width of this box.

So, the width of this red box is maybe 90% of that blue line.

And so BH is 0.9 and the height of

this is maybe one half of the overall height of the grid cell.

So in that case, BW would be, let's say 0.5.

So, in other words, this bx, by, BH,

BW as specified relative to the grid cell.

And so bx and by,

this has to be between 0 and 1, right?

Because pretty much by definition that

orange dot is within the bounds of that grid cell is assigned to.

If it wasn't between 0 and 1 it was outside the square,

then we'll have been assigned to a different grid cell.

But these could be greater than one.

In particular if you have a car where the bounding box was that,

then the height and width of the bounding box,

this could be greater than one.

So, there are multiple ways of specifying the bounding boxes,

but this would be one convention that's quite reasonable.

Although, if you read the YOLO research papers,

the YOLO research line there were

other parameterizations that work even a little bit better,

but I hope this gives one reasonable condition that should work okay.

Although, there are some more complicated parameterizations

involving sigmoid functions to make sure this is between 0 and 1.

And using an explanation parameterization to make sure that these are non-negative,

since 0.9, 0.5, this has to be greater or equal to zero.

There are some other more advanced parameterizations

that work things a little bit better,

but the one you saw here should work okay.

So, that's it for the YOLO or the You Only Look Once algorithm.

And in the next few videos I'll show you

a few other ideas that will help make this algorithm even better.

In the meantime, if you want,

you can take a look at

YOLO paper reference at the bottom of these past couple slides I use.

Although, just one warning,

if you take a look at these papers which is

the YOLO paper is one of the harder papers to read.

I remember, when I was reading this paper for the first time,

I had a really hard time figuring out what was going on.

And I wound up asking a couple of my friends,

very good researchers to help me figure it out,

and even they had a hard time understanding some of the details of the paper.

So, if you look at the paper,

it's okay if you have a hard time figuring it out.

I wish it was more uncommon,

but it's not that uncommon, sadly,

for even senior researchers,

that review research papers and have a hard time figuring out the details.

And have to look at open source code,

or contact the authors,

or something else to figure out the details of these outcomes.

But don't let me stop you from taking a look at the paper yourself though if you wish,

but this is one of the harder ones.

So, that though, you now understand the basics of the YOLO algorithm.

Let's go on to some additional pieces that will make this algorithm work even better.