0:00

[MUSIC]

This video will study human 2D pose estimation.

The problem with localizing anatomical key points or

parts that has largely focused on fighting body parts of individuals.

Inferring the oppose of multiple people and images.

Especially, socially engaged individuals it presents a unique set of challenges.

Each image may contain an unknown number of people that can occur at

any position or scale.

Second, interactions between people induce complex spatial interference due to

contrast and contact.

0:43

A common approach is to employ a person detector, and

perform person single-person pose estimation for each detection.

That's a top down approach that directly leverages existing techniques for

single-person pose estimation, but suffers from early commitment.

If the person detector fails, and it may do so

when people are in close proximity, there is no recourse to recovery.

Furthermore, the run time of these top-down approaches as proportional to

the number of people.

Because for each detection, a single person pose estimator is run.

And the more people there are, the greater the computational cost.

Nevertheless, we'll analyze several approaches of single

person pose estimation.

The problem of human pose estimation can be formulated as the task of finding

the position of human body key-points, head, shoulders, elbows, wrists and so on.

And we can formulate this as a regression problem.

2:08

At training time, the ground truth labels, or heat maps,

synthesized for each joint separately by placing a Gaussian

with fixed variance and centered at the joint position.

This network can be fully convolutional, and

the loss can be held to which penalizes the square pixel-wise differences

between the produced heat map and the synthesized ground truth heat map.

Of course, the approach is to regress the location of key points is naive for

more complicated tasks.

3:04

All the above detectors work well to estimate the posture of one person.

As we have said before, we can apply them after detecting people on image, but

this top down approach has some drawbacks.

The most serious is that the inference time is proportional to the number of

people.

In contrast,

bottom-up approaches are attractive as they offer robustness to early commitment.

And have the potential to decouple run time complexity from the number of people

in the image.

Yet, bottom-up approaches do not directly use global contextual cues from other

body parts and other people.

In practice, bottom-up methods do not retain the gains in

efficiency as the final parts requires costly global inference.

The DeepCut algorithm bottom-up approach that

jointly labels part detection candidates and

associates them to individual people.

Requires solving the integer linear problem over a fully connected graph.

But that's an NP hard problem and

the average processing time is on the order of hours.

There was an attempt to build on the DeepCut with stronger part detectors

based on residual networks and image dependent pair-wise cores.

That vastly improved the run time, but the method still takes several

minutes per image with a limit of the number of parts per posals.

Now, let's talk about the state of the art and

multiple pose estimation task that works in real time.

The key point of this method is using part affinity fields.

This method takes the entire image as the input for

a two branch CNN to jointly predict confidence maps for body part detection.

And part affinity fields for parts association or limbs.

The parsing step performs a set of bipartite matchings to associate

body part candidates that finally assembles into full body poses for

all people in the image.

The CNN that predicts feature maps for body parts and limbs is multistage.

Each stage, in the first branch, predicts confidence maps of body parts.

And each stage in the second branch predicts PAS of limbs.

After each stage, the predictions from the two branches, along with

the image features, are concatenated for next stage for refinement.

This process, you can see, in the left picture.

Confidence maps of the right wrist in the first row and

affinity field in the second row of right forearm across stages.

Though there is confusion between left and right body parts and

limbs in early stages, the estimates are increasingly refined through

global inference in later stages as shown in the highlighted areas.

Given a set of detected body parts,

how do we assemble them to form the full body poses of an unknown number of people?

We need a confidence measure on the association for

each part of body part detections.

6:17

That is, that they belong to the same person.

In other words, we should predict not only the location, but also the orientation and

formation across the region of support of the limb.

For this task, we could use part affinity fields.

The part affinity field that is a 2D vector field for each limb.

For each pixel in the area belonging to a particular limb, a 2D vector

encodes the direction that points form one part of the limb to the other.

For instance, in the right picture in the ground truth feature map or

part affinity field, corresponding to the forearm.

The value of the point, p, is the unit factor from j1 to j2 and

k is the person ID on the image.

For all our other points, the vector is zero valid.

Having these key points candidates confidence map from our convolutional

neural network and orientation of limbs.

We could find the optimal association of key points to people by

solving a maximum weighted bipartite graph matching problem for

graph where vertices are the key point's candidates,

and edges are weighted limbs, as in the bottom-left picture.

The weights, or the confidence of their association, and

they can be calculated as the line integral over the corresponding affinity

field along the line segment connecting the candidate part locations.

7:48

To summarize, human pose estimation

aims to predict locations of anatomical keypoints for individual people.

And we can employ a part-based methods for that,

like those used for keypoint regression.

Semantic segmentation machinery forms the natural basis for

pose estimation along with some other hacks,

like the affinity fields.

[SOUND]