Attended end-to-end architecture for age estimation from facial expression videos

Wenjie Pei; Hamdi Dibeklioglu; Tadas Baltrusaitis; David M.J. Tax

doi:10.1109/TIP.2019.2948288

Attended end-to-end architecture for age estimation from facial expression videos

Wenjie Pei^*, Hamdi Dibeklioglu, Tadas Baltrusaitis, David M.J. Tax

^*Corresponding author for this work

Pattern Recognition and Bioinformatics

Research output: Contribution to journal › Article › Scientific › peer-review

34 Citations (Scopus)

74 Downloads (Pure)

Abstract

The main challenges of age estimation from facial expression videos lie not only in the modeling of the static facial appearance, but also in the capturing of the temporal facial dynamics. Traditional techniques to this problem focus on constructing handcrafted features to explore the discriminative information contained in facial appearance and dynamics separately. This relies on sophisticated feature-refinement and framework-design. In this paper, we present an end-to-end architecture for age estimation, called Spatially-Indexed Attention Model (SIAM), which is able to simultaneously learn both the appearance and dynamics of age from raw videos of facial expressions. Specifically, we employ convolutional neural networks to extract effective latent appearance representations and feed them into recurrent networks to model the temporal dynamics. More importantly, we propose to leverage attention models for salience detection in both the spatial domain for each single image and the temporal domain for the whole video as well. We design a specific spatially-indexed attention mechanism among the convolutional layers to extract the salient facial regions in each individual image, and a temporal attention layer to assign attention weights to each frame. This two-pronged approach not only improves the performance by allowing the model to focus on informative frames and facial areas, but it also offers an interpretable correspondence between the spatial facial regions as well as temporal frames, and the task of age estimation. We demonstrate the strong performance of our model in experiments on a large, gender-balanced database with 400 subjects with ages spanning from 8 to 76 years. Experiments reveal that our model exhibits significant superiority over the state-of-the-art methods given sufficient training data.

Original language	English
Article number	8882508
Pages (from-to)	1972-1984
Number of pages	13
Journal	IEEE Transactions on Image Processing
Volume	29
DOIs	https://doi.org/10.1109/TIP.2019.2948288
Publication status	Published - 2020

Keywords

Age estimation
attention
end-to-end
facial dynamics

Access to Document

10.1109/TIP.2019.2948288

Age_estimation_TIPAccepted author manuscript, 634 KB

Cite this

@article{bcc91ddb55e44091a25075603bb1b9cb,

title = "Attended end-to-end architecture for age estimation from facial expression videos",

abstract = "The main challenges of age estimation from facial expression videos lie not only in the modeling of the static facial appearance, but also in the capturing of the temporal facial dynamics. Traditional techniques to this problem focus on constructing handcrafted features to explore the discriminative information contained in facial appearance and dynamics separately. This relies on sophisticated feature-refinement and framework-design. In this paper, we present an end-to-end architecture for age estimation, called Spatially-Indexed Attention Model (SIAM), which is able to simultaneously learn both the appearance and dynamics of age from raw videos of facial expressions. Specifically, we employ convolutional neural networks to extract effective latent appearance representations and feed them into recurrent networks to model the temporal dynamics. More importantly, we propose to leverage attention models for salience detection in both the spatial domain for each single image and the temporal domain for the whole video as well. We design a specific spatially-indexed attention mechanism among the convolutional layers to extract the salient facial regions in each individual image, and a temporal attention layer to assign attention weights to each frame. This two-pronged approach not only improves the performance by allowing the model to focus on informative frames and facial areas, but it also offers an interpretable correspondence between the spatial facial regions as well as temporal frames, and the task of age estimation. We demonstrate the strong performance of our model in experiments on a large, gender-balanced database with 400 subjects with ages spanning from 8 to 76 years. Experiments reveal that our model exhibits significant superiority over the state-of-the-art methods given sufficient training data.",

keywords = "Age estimation, attention, end-to-end, facial dynamics",

author = "Wenjie Pei and Hamdi Dibeklioglu and Tadas Baltrusaitis and Tax, {David M.J.}",

year = "2020",

doi = "10.1109/TIP.2019.2948288",

language = "English",

volume = "29",

pages = "1972--1984",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers (IEEE)",

}

TY - JOUR

T1 - Attended end-to-end architecture for age estimation from facial expression videos

AU - Pei, Wenjie

AU - Dibeklioglu, Hamdi

AU - Baltrusaitis, Tadas

AU - Tax, David M.J.

PY - 2020

Y1 - 2020

N2 - The main challenges of age estimation from facial expression videos lie not only in the modeling of the static facial appearance, but also in the capturing of the temporal facial dynamics. Traditional techniques to this problem focus on constructing handcrafted features to explore the discriminative information contained in facial appearance and dynamics separately. This relies on sophisticated feature-refinement and framework-design. In this paper, we present an end-to-end architecture for age estimation, called Spatially-Indexed Attention Model (SIAM), which is able to simultaneously learn both the appearance and dynamics of age from raw videos of facial expressions. Specifically, we employ convolutional neural networks to extract effective latent appearance representations and feed them into recurrent networks to model the temporal dynamics. More importantly, we propose to leverage attention models for salience detection in both the spatial domain for each single image and the temporal domain for the whole video as well. We design a specific spatially-indexed attention mechanism among the convolutional layers to extract the salient facial regions in each individual image, and a temporal attention layer to assign attention weights to each frame. This two-pronged approach not only improves the performance by allowing the model to focus on informative frames and facial areas, but it also offers an interpretable correspondence between the spatial facial regions as well as temporal frames, and the task of age estimation. We demonstrate the strong performance of our model in experiments on a large, gender-balanced database with 400 subjects with ages spanning from 8 to 76 years. Experiments reveal that our model exhibits significant superiority over the state-of-the-art methods given sufficient training data.

AB - The main challenges of age estimation from facial expression videos lie not only in the modeling of the static facial appearance, but also in the capturing of the temporal facial dynamics. Traditional techniques to this problem focus on constructing handcrafted features to explore the discriminative information contained in facial appearance and dynamics separately. This relies on sophisticated feature-refinement and framework-design. In this paper, we present an end-to-end architecture for age estimation, called Spatially-Indexed Attention Model (SIAM), which is able to simultaneously learn both the appearance and dynamics of age from raw videos of facial expressions. Specifically, we employ convolutional neural networks to extract effective latent appearance representations and feed them into recurrent networks to model the temporal dynamics. More importantly, we propose to leverage attention models for salience detection in both the spatial domain for each single image and the temporal domain for the whole video as well. We design a specific spatially-indexed attention mechanism among the convolutional layers to extract the salient facial regions in each individual image, and a temporal attention layer to assign attention weights to each frame. This two-pronged approach not only improves the performance by allowing the model to focus on informative frames and facial areas, but it also offers an interpretable correspondence between the spatial facial regions as well as temporal frames, and the task of age estimation. We demonstrate the strong performance of our model in experiments on a large, gender-balanced database with 400 subjects with ages spanning from 8 to 76 years. Experiments reveal that our model exhibits significant superiority over the state-of-the-art methods given sufficient training data.

KW - Age estimation

KW - attention

KW - end-to-end

KW - facial dynamics

UR - http://www.scopus.com/inward/record.url?scp=85077494117&partnerID=8YFLogxK

U2 - 10.1109/TIP.2019.2948288

DO - 10.1109/TIP.2019.2948288

M3 - Article

AN - SCOPUS:85077494117

SN - 1057-7149

VL - 29

SP - 1972

EP - 1984

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

M1 - 8882508

ER -

Attended end-to-end architecture for age estimation from facial expression videos

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this