A summary of Joint Head Pose Estimation and Face Alignment Framework Using Global and Local CNN Features by Xiang Xu and A. Kakadiaris

Nicholas M. Synovic

12-04-2022 - 4 minutes read - 821 words

A summary of Joint Head Pose Estimation and Face Alignment Framework Using Global and Local CNN Features

Xiang Xu and A. Kakadiaris IEEE FG, 2017 DOI

For the summary of the paper, go to the Summary section of this article.

A summary of Joint Head Pose Estimation and Face Alignment Framework Using Global and Local CNN Features

First Pass

Read the title, abstract, introduction, section and sub-section headings, and conclusion

Problem

What is the problem addressed in the paper?

The problem is two fold:

To build a face detector that is robust to occlusion and “in the wild” images.
To incorporate facial landmarks via global and local features of the image to better isolate not only where the face(s) are within the image, but also the eyes, jaw, mouth, etc.

Motivation

Why should we care about this paper?

Because the solution proposed allows for the identification and isolation of facial features of an input image.

Context

What other types of papers is the work related to?

This paper is related to works involving face and facial landmark detection.

Contributions

What are the author’s main contributions?

Their contributions are a methodology for leveraging the relationship between head pose and face landmarks to find the best shape for initialization and the usage of global and local CNN features for pose estimation and face alignment in a cascading manner.

Second Pass

A proper read through of the paper is required to answer this

Background Work

What has been done prior to this paper?

Work has been done to detect faces and to identify the facial landmarks of a detected face.

Figures, Diagrams, Illustrations, and Graphs

Are the axes properly labeled? Are results shown with error bars, so that conclusions are statistically significant?

Nearly all of the figures, charts, and tables are clear. Figure 4 is lacking description for which set of images correspond to what model.

Clarity

Is the paper well written?

The paper is well written.

Relevant Work

Mark relevant work for review

The following relevant work can be found in the Citations section of this article.

Methodology

What methodology did the author’s use to validate their contributions?

The authors compared their solution (JFA) against other SOTA models including ERT on pose and facial landmark detection datasets.

Author Assumptions

What assumptions does the author(s) make? Are they justified assumptions?

That global features can capture fine grain detail of an image.

Correctness

Do the assumptions seem valid?

Not really, as global features are intended to look at the feature map as a whole to identify very large or prominent features (such as an entire face). To the authors credit, they do discuss this at the end of the paper. But as it stands, the usage of global features to identify fine grain detail (such as facial landmarks) is a waste of computational resources that only attempts to mimic the local feature analysis that their algorithm also performs.

Future Directions

My own proposed future directions for the work

I’d like to see if I could optimize their solution by removing the global feature analysis for facial landmarks component. The idea behind optimizing their solution is to implement it on a mobile or low powered system.

Open Questions

What open questions do I have about the work?

As the solution proposed involves a database lookup to identify areas of interest for facial landmarks, how robust is that database to rotation, scale, lighting, and occlusion variance?

Author Feedback

What feedback would I give to the authors?

This is an interesting solution to the problem of identifying facial landmarks, however

Summary

A summary of the paper

The paper Joint Head Pose Estimation and Face Alignment Framework Using Global and Local CNN Features by Xiang Xu and A. Kakadiaris [1] describes a solution to not only detect faces, but identify facial landmarks using a cascading CNN architecture. The reference model for the architecture was able to attain SOTA results using an off-the-shelf face detector (Dlib) and could accurately identify facial landmarks. It is able top detect facial landmarks by analyzing both the global and local features generated from the face detectors feature maps. A global feature detector is used for pose estimation. This is then passed into a database of poses used for estimating where facial features are most likely to be in the input image. A local feature detector than analyzes the feature map to identify all of the facial landmarks within the image.

Summarization Technique

This paper was summarized using a modified technique proposed by S. Keshav in his work How to Read a Paper [0].