0:00

One of the problems with object detection as you have seen it so far is

that each of the grid cells can detect only one object.

What if a grid cell wants to detect multiple objects?

Here is what you can do.

You can use the idea of anchor boxes.

Let's start with an example.

Let's say you have an image like this.

And for this example,

I am going to continue to use a 3 by 3 grid.

Notice that the midpoint of the pedestrian and the midpoint of the car are

in almost the same place and both of them fall into the same grid cell.

So, for that grid cell,

if Y outputs this vector where you are detecting three causes,

pedestrians, cars and motorcycles,

it won't be able to output two detections.

So I have to pick one of the two detections to output.

With the idea of anchor boxes,

what you are going to do,

is pre-define two different shapes called, anchor boxes or anchor box shapes.

And what you are going to do is now,

be able to associate two predictions with the two anchor boxes.

And in general, you might use more anchor boxes,

maybe five or even more.

But for this video, I am just going to use

two anchor boxes just to make the description easier.

So what you do is you define the cross label to be,

instead of this vector on the left,

you basically repeat this twice.

S, you will have PC, PX, PY,

PH, PW, C1, C2, C3,

and these are the eight outputs associated with anchor box 1.

And then you repeat that PC,

PX and so on down to C1,

C2, C3, and other eight outputs associated with anchor box 2.

So, because the shape of

the pedestrian is more similar to the shape of anchor box 1 and anchor box 2,

you can use these eight numbers to encode that PC as one,

yes there is a pedestrian.

Use this to encode the bounding box around the pedestrian,

and then use this to encode that that object is a pedestrian.

And then because the box

around the car is more similar to the shape of anchor box 2 than anchor box 1,

you can then use this to encode that the second object here is the car,

and have the bounding box and so

on be all the parameters associated with the detected car.

So to summarize, previously,

before you are using anchor boxes,

you did the following,

which is for each object in the training set and the training set image,

it was assigned to the grid cell that corresponds to that object's midpoint.

And so the output Y was 3 by 3 by 8 because you have a 3 by 3 grid.

And for each grid position,

we had that output vector which is PC, then the bounding box, and C1, C2, C3.

With the anchor box,

you now do that following.

Now, each object is assigned to the same grid cell as before,

assigned to the grid cell that contains the object's midpoint,

but it is assigned to a grid cell and

anchor box with the highest IoU with the object's shape.

So, you have two anchor boxes,

you will take an object and see.

So if you have an object with this shape,

what you do is take your two anchor boxes.

Maybe one anchor box is this this shape that's anchor box 1,

maybe anchor box 2 is this shape,

and then you see which of the two anchor boxes has a higher IoU,

will be drawn through bounding box.

And whichever it is,

that object then gets assigned not just to a grid cell but to a pair.

It gets assigned to grid cell comma anchor box pair.

And that's how that object gets encoded in the target label.

And so now, the output Y is going to be 3 by 3 by 16.

Because as you saw on the previous slide,

Y is now 16 dimensional.

Or if you want,

you can also view this as 3 by 3 by 2 by 8

because there are now two anchor boxes and Y is eight dimensional.

And dimension of Y being eight was because we have three objects causes

if you have more objects than the dimension of Y would be even higher.

So let's go through a complete example.

For this grid cell,

let's specify what is Y.

So the pedestrian is more similar to the shape of anchor box 1.

So for the pedestrian,

we're going to assign it to the top half of this vector.

So yes, there is an object,

there will be some bounding box associated at the pedestrian.

And I guess if a pedestrian is cos one,

then we see one as one, and then zero, zero.

And then the shape of the car is more similar to anchor box 2.

And so the rest of this vector will be

one and then the bounding box associated with the car,

and then the car is C2,

so there's zero, one, zero.

And so that's the label Y for

that lower middle grid cell that this arrow was pointing to.

Now, what if this grid cell only had a car and had no pedestrian?

If it only had a car,

then assuming that the shape of the bounding box around

the car is still more similar to anchor box 2,

then the target label Y,

if there was just a car there and the pedestrian had gone away,

it will still be the same for the anchor box 2 component.

Remember that this is a part of the vector corresponding to anchor box 2.

And for the part of the vector corresponding to anchor box 1,

what you do is you just say there is no object there.

So PC is zero,

and then the rest of these will be don't cares.

Now, just some additional details.

What if you have two anchor boxes but three objects in the same grid cell?

That's one case that this algorithm doesn't handle well.

Hopefully, it won't happen.

But if it does, this algorithm doesn't have a great way of handling it.

I will just influence some default tiebreaker for that case.

Or what if you have two objects associated with the same grid cell,

but both of them have the same anchor box shape?

Again, that's another case that this algorithm doesn't handle well.

If you influence some default way of tiebreaking if that happens,

hopefully this won't happen with your data set,

it won't happen much at all.

And so, it shouldn't affect performance as much.

So, that's it for anchor boxes.

And even though I'd motivated anchor boxes as a way to

deal with what happens if two objects appear in the same grid cell,

in practice, that happens quite rarely,

especially if you use a 19 by 19 rather than a 3 by 3 grid.

The chance of two objects having the same midpoint rather these 361 cells,

it does happen, but it doesn't happen that often.

Maybe even better motivation or even better results that

anchor boxes gives you is it allows your learning algorithm to specialize better.

In particular, if your data set has some tall,

skinny objects like pedestrians,

and some white objects like cars,

then this allows your learning algorithm to specialize so

that some of the outputs can specialize in detecting white,

fat objects like cars,

and some of the output units can specialize in detecting tall,

skinny objects like pedestrians.

So finally, how do you choose the anchor boxes?

And people used to just choose them by hand or choose maybe five or 10 anchor box

shapes that spans a variety of shapes that seems

to cover the types of objects you seem to detect.

As a much more advanced version,

just in the advance common for those of who have other knowledge in machine learning,

and even better way to do this in one of the later YOLO research papers,

is to use a K-means algorithm,

to group together two types of objects shapes you tend to get.

And then to use that to select a set of anchor boxes that

this most stereotypically representative of the maybe multiple,

of the maybe dozens of object causes you're trying to detect.

But that's a more advanced way to automatically choose the anchor boxes.

And if you just choose by hand a variety of shapes

that reasonably expands the set of object shapes,

you expect to detect some tall,

skinny ones, some fat, white ones.

That should work with these as well.

So that's it for anchor boxes.

In the next video,

let's take everything we've seen and tie it back together into the YOLO algorithm.