0:00

One of the problems with object detection as you have seen it so far is

Â that each of the grid cells can detect only one object.

Â What if a grid cell wants to detect multiple objects?

Â Here is what you can do.

Â You can use the idea of anchor boxes.

Â Let's start with an example.

Â Let's say you have an image like this.

Â And for this example,

Â I am going to continue to use a 3 by 3 grid.

Â Notice that the midpoint of the pedestrian and the midpoint of the car are

Â in almost the same place and both of them fall into the same grid cell.

Â So, for that grid cell,

Â if Y outputs this vector where you are detecting three causes,

Â pedestrians, cars and motorcycles,

Â it won't be able to output two detections.

Â So I have to pick one of the two detections to output.

Â With the idea of anchor boxes,

Â what you are going to do,

Â is pre-define two different shapes called, anchor boxes or anchor box shapes.

Â And what you are going to do is now,

Â be able to associate two predictions with the two anchor boxes.

Â And in general, you might use more anchor boxes,

Â maybe five or even more.

Â But for this video, I am just going to use

Â two anchor boxes just to make the description easier.

Â So what you do is you define the cross label to be,

Â instead of this vector on the left,

Â you basically repeat this twice.

Â S, you will have PC, PX, PY,

Â PH, PW, C1, C2, C3,

Â and these are the eight outputs associated with anchor box 1.

Â And then you repeat that PC,

Â PX and so on down to C1,

Â C2, C3, and other eight outputs associated with anchor box 2.

Â So, because the shape of

Â the pedestrian is more similar to the shape of anchor box 1 and anchor box 2,

Â you can use these eight numbers to encode that PC as one,

Â yes there is a pedestrian.

Â Use this to encode the bounding box around the pedestrian,

Â and then use this to encode that that object is a pedestrian.

Â And then because the box

Â around the car is more similar to the shape of anchor box 2 than anchor box 1,

Â you can then use this to encode that the second object here is the car,

Â and have the bounding box and so

Â on be all the parameters associated with the detected car.

Â So to summarize, previously,

Â before you are using anchor boxes,

Â you did the following,

Â which is for each object in the training set and the training set image,

Â it was assigned to the grid cell that corresponds to that object's midpoint.

Â And so the output Y was 3 by 3 by 8 because you have a 3 by 3 grid.

Â And for each grid position,

Â we had that output vector which is PC, then the bounding box, and C1, C2, C3.

Â With the anchor box,

Â you now do that following.

Â Now, each object is assigned to the same grid cell as before,

Â assigned to the grid cell that contains the object's midpoint,

Â but it is assigned to a grid cell and

Â anchor box with the highest IoU with the object's shape.

Â So, you have two anchor boxes,

Â you will take an object and see.

Â So if you have an object with this shape,

Â what you do is take your two anchor boxes.

Â Maybe one anchor box is this this shape that's anchor box 1,

Â maybe anchor box 2 is this shape,

Â and then you see which of the two anchor boxes has a higher IoU,

Â will be drawn through bounding box.

Â And whichever it is,

Â that object then gets assigned not just to a grid cell but to a pair.

Â It gets assigned to grid cell comma anchor box pair.

Â And that's how that object gets encoded in the target label.

Â And so now, the output Y is going to be 3 by 3 by 16.

Â Because as you saw on the previous slide,

Â Y is now 16 dimensional.

Â Or if you want,

Â you can also view this as 3 by 3 by 2 by 8

Â because there are now two anchor boxes and Y is eight dimensional.

Â And dimension of Y being eight was because we have three objects causes

Â if you have more objects than the dimension of Y would be even higher.

Â So let's go through a complete example.

Â For this grid cell,

Â let's specify what is Y.

Â So the pedestrian is more similar to the shape of anchor box 1.

Â So for the pedestrian,

Â we're going to assign it to the top half of this vector.

Â So yes, there is an object,

Â there will be some bounding box associated at the pedestrian.

Â And I guess if a pedestrian is cos one,

Â then we see one as one, and then zero, zero.

Â And then the shape of the car is more similar to anchor box 2.

Â And so the rest of this vector will be

Â one and then the bounding box associated with the car,

Â and then the car is C2,

Â so there's zero, one, zero.

Â And so that's the label Y for

Â that lower middle grid cell that this arrow was pointing to.

Â Now, what if this grid cell only had a car and had no pedestrian?

Â If it only had a car,

Â then assuming that the shape of the bounding box around

Â the car is still more similar to anchor box 2,

Â then the target label Y,

Â if there was just a car there and the pedestrian had gone away,

Â it will still be the same for the anchor box 2 component.

Â Remember that this is a part of the vector corresponding to anchor box 2.

Â And for the part of the vector corresponding to anchor box 1,

Â what you do is you just say there is no object there.

Â So PC is zero,

Â and then the rest of these will be don't cares.

Â Now, just some additional details.

Â What if you have two anchor boxes but three objects in the same grid cell?

Â That's one case that this algorithm doesn't handle well.

Â Hopefully, it won't happen.

Â But if it does, this algorithm doesn't have a great way of handling it.

Â I will just influence some default tiebreaker for that case.

Â Or what if you have two objects associated with the same grid cell,

Â but both of them have the same anchor box shape?

Â Again, that's another case that this algorithm doesn't handle well.

Â If you influence some default way of tiebreaking if that happens,

Â hopefully this won't happen with your data set,

Â it won't happen much at all.

Â And so, it shouldn't affect performance as much.

Â So, that's it for anchor boxes.

Â And even though I'd motivated anchor boxes as a way to

Â deal with what happens if two objects appear in the same grid cell,

Â in practice, that happens quite rarely,

Â especially if you use a 19 by 19 rather than a 3 by 3 grid.

Â The chance of two objects having the same midpoint rather these 361 cells,

Â it does happen, but it doesn't happen that often.

Â Maybe even better motivation or even better results that

Â anchor boxes gives you is it allows your learning algorithm to specialize better.

Â In particular, if your data set has some tall,

Â skinny objects like pedestrians,

Â and some white objects like cars,

Â then this allows your learning algorithm to specialize so

Â that some of the outputs can specialize in detecting white,

Â fat objects like cars,

Â and some of the output units can specialize in detecting tall,

Â skinny objects like pedestrians.

Â So finally, how do you choose the anchor boxes?

Â And people used to just choose them by hand or choose maybe five or 10 anchor box

Â shapes that spans a variety of shapes that seems

Â to cover the types of objects you seem to detect.

Â As a much more advanced version,

Â just in the advance common for those of who have other knowledge in machine learning,

Â and even better way to do this in one of the later YOLO research papers,

Â is to use a K-means algorithm,

Â to group together two types of objects shapes you tend to get.

Â And then to use that to select a set of anchor boxes that

Â this most stereotypically representative of the maybe multiple,

Â of the maybe dozens of object causes you're trying to detect.

Â But that's a more advanced way to automatically choose the anchor boxes.

Â And if you just choose by hand a variety of shapes

Â that reasonably expands the set of object shapes,

Â you expect to detect some tall,

Â skinny ones, some fat, white ones.

Â That should work with these as well.

Â So that's it for anchor boxes.

Â In the next video,

Â let's take everything we've seen and tie it back together into the YOLO algorithm.

Â