logo

SCIENTIA SINICA Informationis, Volume 49, Issue 4: 450-463(2019) https://doi.org/10.1360/N112018-00060

A multi-pose face frontalization method based on encoder-decoder network

More info
  • ReceivedMay 26, 2018
  • AcceptedJun 25, 2018
  • PublishedApr 11, 2019

Abstract

Multi-pose face frontalization can alleviate the influence of pose variance on face analysis. The traditional method of synthesizing a frontal face image directly from a multi-pose face image presents a problem in missing face details. To overcome this problem, we propose a face frontalization method based on the encoder-decoder network, namely multitask convolutional encoder-decoder network (MCEDN). The MCEDN introduces a frontal raw feature network to synthesize the global raw features of the frontal face. Then, the network utilizes the decoder to synthesize a clearer frontal face image by fusing local features extracted by the encoder and global raw features. We use a multitask learning mechanism to build an end-to-end model. The method then integrates three modules, namely local feature extraction, global raw feature synthesis, and frontal image synthesis. The model performance was improved by sharing parameters. In comparison with existing methods, MCEDN can synthesize frontal face images with a stable structure and rich details on multiple datasets. Then, we use the synthesized frontal images for face recognition and face expression recognition, and the state-of-the-art results demonstrate that the MCEDN preserves a number of face details.


Funded by

国家重点研发计划项目(2016YFB1001405)

国家自然科学基金项目(61661146002)

中国科学院前沿科学重点研究计划项目(QYZDY-SSW-JSC041)


References

[1] Zhu Z Y, Luo P, Wang X G, et al. Deep learning identity-preserving face space. In: Proceedings of the IEEE International Conference on Computer Vision, Sydney, 2013. 113--120. Google Scholar

[2] Zhu Z Y, Luo P, Wang X G, et al. Multi-view perceptron: a deep model for learning face identity and view representations. In: Proceedings of the Advances in Neural Information Processing Systems, Montreal, 2014. 217--225. Google Scholar

[3] Argyriou A, Evgeniou T, Pontil M. Multi-task feature learning. In: Proceedings of the 20th Annual Conference on Neural Information Processing Systems, Vancouver, 2006. 41--48. Google Scholar

[4] Zhu X Y, Lei Z, Yan J J, et al. High-fidelity pose and expression normalization for face recognition in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 787--796. Google Scholar

[5] Asthana A, Marks T K, Jones M J, et al. Fully automatic pose-invariant face recognition via 3D pose normalization. In: Proceedings of the International Conference on Computer Vision, Barcelona, 2011. 937--944. Google Scholar

[6] Hassner T, Harel S, Paz E, et al. Effective face frontalization in unconstrained images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 4295--4304. Google Scholar

[7] Fang S Y, Zhou D K, Cao Y P, et al. Frontal face image synthesis based on pose estimation. Comput Eng, 2015, 41: 240--244. Google Scholar

[8] Prince S J D, Warrell J, Elder J H. Tied factor analysis for face recognition across large pose differences.. IEEE Trans Pattern Anal Mach Intell, 2008, 30: 970-984 CrossRef PubMed Google Scholar

[9] Chai X J, Shan S G, Chen X L. Locally Linear Regression for Pose-Invariant Face Recognition. IEEE Trans Image Process, 2007, 16: 1716-1725 CrossRef ADS Google Scholar

[10] Wang Y N, Su J B. Multipose face image recognition based on image synthesis. Pattern Recogn Artif Intel, 2015, 28: 848--856. Google Scholar

[11] Li Y L, Feng J F. Multi-view face synthesis using minimum bending deformation. J Comput-Aided Design Comput Graph, 2011, 23: 1085--1090. Google Scholar

[12] Yi X B, Chen Y. Frontal face synthesizing based on poisson image fusion under piecewise affine warp. Comput Eng Appl, 2016, 52: 172--177. Google Scholar

[13] Kan M N, Shan S G, Chang H, et al. Stacked progressive auto-encoders (spae) for face recognition across poses. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. 1883--1890. Google Scholar

[14] Ouyang N, Ma Y T, Lin L P. Multi-pose face reconstruction and recognition based on multi-task learning. J Comput Appl, 2016, 37: 896--900. Google Scholar

[15] Yim J, Jung H, Yoo B, et al. Rotating your face using multi-task deep neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 676--684. Google Scholar

[16] Ghodrati A, Jia X, Pedersoli M, et al. Towards automatic image editing: learning to see another you. 2015,. arXiv Google Scholar

[17] Huang R, Zhang S, Li T Y, et al. Beyond face rotation: global and local perception gan for photorealistic and identity preserving frontal view synthesis. 2017,. arXiv Google Scholar

[18] Tran L, Yin X, Liu X M. Disentangled representation learning gan for pose-invariant face recognition. In: Proceedings of the Computer Vision and Pattern Recognition, Honolulu, 2017. 1283--1292. Google Scholar

[19] Theis L, Shi W, Cunningham A, et al. Lossy image compression with compressive autoencoders. 2017,. arXiv Google Scholar

[20] Goodfellow I, Bengio Y, Courville A, et al. Deep Learning. Cambridge: MIT Press, 2016. Google Scholar

[21] Mayya V, Pai R M, Manohara Pai M M. Automatic Facial Expression Recognition Using DCNN. Procedia Comput Sci, 2016, 93: 453-461 CrossRef Google Scholar

[22] Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines. In: Proceedings of International Conference on Machine Learning, Haifa, 2010. 807--814. Google Scholar

[23] Gross R, Matthews I, Cohn J. Multi-PIE.. Image Vision Computing, 2010, 28: 807-813 CrossRef PubMed Google Scholar

[24] Gao W, Cao B, Shan S G. The CAS-PEAL Large-Scale Chinese Face Database and Baseline Evaluations. IEEE Trans Syst Man Cybern A, 2008, 38: 149-161 CrossRef Google Scholar

[25] Huang G B, Ramesh M, Berg T, et al. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report 07-49. 2007. Google Scholar

[26] Liu Z, Luo P, Wang X G, et al. Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 3730--3738. Google Scholar

[27] Wang Z, Bovik A C, Sheikh H R. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans Image Process, 2004, 13: 600-612 CrossRef ADS Google Scholar

[28] Ding H, Zhou S K, Chellappa R. Facenet2expnet: regularizing a deep face recognition net for expression recognition. In: Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, Washington, 2017. 118--126. Google Scholar

[29] Wu X, He R, Sun Z A, et al. A light CNN for deep face representation with noisy labels. 2015,. arXiv Google Scholar

  • Figure 1

    Multi-pose face frontalization network structure

  • Figure 2

    Feature analysis subtask network structure

  • Figure 3

    Synthesis results by each subtask on Multi-PIE

  • Figure 4

    Synthesis results by MCEDN under different poses

  • Figure 5

    Synthesis results by MCEDN under different poses on CAS-PEAL-R1 dataset

  • Figure 6

    Synthesis results on different datasets by different method

  • Figure 7

    Synthesis results under various illuminations

  •   

    Algorithm 1 Feature synthesis

    Require:$D$: multi-pose face image set; $R$: frontal face image set; $B$: batch size; $T$: number of updates; $\eta$: learning rate; $\theta$: trainable parameter set for this subtask; $\theta_{i,j}$: trainable parameter set for the $i$-th to $j$-th layers of the subtask; $\alpha$: weights of the similarity loss of the subtask; $\beta$: weights of the similarity loss of the image synthesis task.

    for $t=0~,\ldots,~T$

    The $B$ images are sampled from training set $D$ as the current training data set $X$. The $B$ frontal images in the data set $R$ corresponding to the images in the data set $X$ are taken as the current target data set $Y$;

    $F_{l}~\leftarrow~f_{\theta_{1,3}}~(~X~~)$; // $F_{l}$: local features of multi-pose face images

    $F_{g}~\leftarrow~f_{\theta_{4,8}}~(~F_{l}~~)$; // $F_{g}$: global features of frontal face images

    $\hat{X}_{m}~\leftarrow~f_{\theta_{9,11}}~(~F_{g}~~)$; // $\hat{X}_{m}$: output of the subtask

    $L_{m}~\leftarrow~\frac{1}{B}\sum_{i=1}^{B}~\|~\hat{X}_{m}^{i}-Y^{i}~~\|_{2}^{2}$; // $L_{m}$: similarity loss of feature synthesis task

    $L~\leftarrow~{\alpha}L_{m}+{\beta}L_{o}$; // $L_{o}$: similarity loss of the image synthesis task; $L$: similarity loss of MCEDN

    $\theta~\leftarrow~{\rm~Adam}~(~\theta~;L;\eta~~~)$;

    end for

  • Table 1   Network parameters
    Layers Input size Weight/stride Output size
    Conv0 64$\times$64$\times$3 5$\times$5/1 64$\times$64$\times$64
    Conv1 64$\times$64$\times$64 5$\times$5/1 64$\times$64$\times$64
    Conv2 64$\times$64$\times$64 5$\times$5/2 32$\times$32$\times$128
    Conv3 32$\times$32$\times$128 5$\times$5/1 32$\times$32$\times$128
    Conv4 32$\times$32$\times$128 5$\times$5/2 16$\times$16$\times$256
    Conv5 16$\times$16$\times$256 5$\times$5/1 16$\times$16$\times$256
    Deconv6 16$\times$16$\times$256 5$\times$5/2 32$\times$32$\times$128
    Conv7 32$\times$32$\times$128 5$\times$5/1 32$\times$32$\times$128
    Deconv8 32$\times$32$\times$128 5$\times$5/2 64$\times$64$\times$64
    Conv9 64$\times$64$\times$64 5$\times$5/1 64$\times$64$\times$64
    Conv10 64$\times$64$\times$64 5$\times$5/1 64$\times$64$\times$3
  •   

    Algorithm 2 Image synthesis

    Require:$D$: multi-pose face image set; $R$: frontal face image set; $B$: batch size; $T$: number of updates; $\eta$: learning rate; $F_{l}$: local features of multi-pose face images, size is $B\times~H\times~W\times~C_{l}$; $F_{g}$: global features of frontal face images, size is $B\times~H\times~W\times~C_{g}$; $\varphi$: trainable parameter set for this subtask; $\alpha$: weights of the similarity loss of feature synthesis task; $\beta$: weights of the similarity loss of the image synthesis task.

    for $t=0~,\ldots,~T$

    The $B$ images are sampled from training set $D$ as the current training data set $X$. The $B$ frontal images in the data set $R$ corresponding to the images in the data set $X$ are taken as the current target data set $Y$;

    Get $F_{l}$ and $F_{g}$ by Algorithm 1;

    $F_{\rm~concat}~\leftarrow~{\rm~Concat}~(~F_{l},~F_{g}~~)$; // $F_{\rm~concat}$: size is $B\times~H\times~W\times~~(C_{l}+C_{g}~~~)$

    $\hat{X}_{o}~\leftarrow~f_{\varphi}~(~F_{\rm~concat}~)$; // $\hat{X}_{o}$: synthetic frontal face image

    $L_{o}~\leftarrow~\frac{1}{B}\sum_{i=1}^{B}~\|~\hat{X}_{o}^{i}-Y^{i}~~\|_{2}^{2}$; // $L_{o}$: similarity loss of image synthesis task

    $L~\leftarrow~{\alpha}L_{m}+{\beta}L_{o}$; // $L_{o}$: similarity loss of the feature synthesis task; $L$: similarity loss of MCEDN

    $\varphi~\leftarrow~{\rm~Adam}~(~\varphi~;L;\eta~~~)$;

    end for

  • Table 2   Training time for different models
    Model Dataset Training time (h)
    Basic convolutional encoder-decoder network (BCEDN) Multi-PIE 23
    Two-stageconvolutional encoder-decoder network (TCEDN) Multi-PIE 54
    MCEDN Multi-PIE 48
    MCEDN (transfer training) CAS-PEAL-R1 1.5
  • Table 3   Similarity evaluation between the results of three models and targets
    Method $\pm~45^{\circ}$ $\pm~30^{\circ}$ $\pm~15^{\circ}$
    SSIM PSNR SSIM PSNR SSIM PSNR
    BCEDN 0.5964 16.9613 0.6278 17.3570 0.7053 19.6656
    TCEDN 0.7037 19.1310 0.7303 20.1150 0.8023 22.2932
    MCEDN 0.7690 23.1557 0.7917 23.4195 0.8719 26.9208
  • Table 4   Similarity evaluation between task1, task2 and targets
    Method $\pm~45^{\circ}$ $\pm~30^{\circ}$ $\pm~15^{\circ}$
    SSIM PSNR SSIM PSNR SSIM PSNR
    Task1 0.7511 22.4044 0.7615 21.9180 0.8021 23.0737
    Task2 0.7690 23.1557 0.7917 23.4195 0.8719 26.9208
  • Table 5   Rank-1 face expression recognition rate
    Method $\pm~45^{\circ}$ $\pm~30^{\circ}$ $\pm~15^{\circ}$
    Original images 0.6503 0.8557 0.9484
    Ref. [15] 0.8511 0.9127 0.9515
    Task1 0.8687 0.9285 0.9501
    Task2 $\mathbf{0.9224}$ $\mathbf{0.9439}$ $\mathbf{0.9618}$
  • Table 6   Rank-1 face recognition rate
    Method $\pm~45^{\circ}$ $\pm~30^{\circ}$ $\pm~15^{\circ}$
    Ref. [15] 0.8838 0.9526 0.9714
    Task1 0.8918 0.9584 0.9746
    Task2 $\mathbf{0.9136}$ $\mathbf{0.9615}$ $\mathbf{0.9803}$

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1