SCIENCE CHINA Information Sciences, Volume 62, Issue 12: 220101(2019) https://doi.org/10.1007/s11432-019-9935-6

## Feature context learning for human parsing

• AcceptedJun 13, 2019
• PublishedNov 11, 2019
Share
Rating

### Abstract

Parsing inconsistency, referring to the scatters and speckles in the parsing results as well as imprecise contours, is a long-standing problem in human parsing. It results from the fact that the pixel-wise classification loss independently considers each pixel. To address the inconsistency issue, we propose in this paper an end-to-end trainable, highly flexible and generic module called feature context module (FCM). FCM explores the correlation of adjacent pixels and aggregates the contextual information embedded in the real topology of the human body. Therefore, the feature representations are enhanced and thus quite robust in distinguishing semantically related parts. Extensive experiments are done with three different backbone models and four benchmark datasets, suggesting that FCM can be an effective and efficient plug-in to consistently improve the performance of existing algorithms without sacrificing the inference speed too much.

### Acknowledgment

This work was supported in part by National Key Research and Development Program of China (Grant No. 2018YFB1004600), National Natural Science Foundation of China (Grant No. 61703171), and Natural Science Foundation of Hubei Province of China (Grant No. 2018CFB199). This work was also supported by Alibaba Group through Alibaba Innovative Research (AIR) Program. The work of Yongchao XU was supported by Young Elite Scientists Sponsorship Program by CAST. The work of Xiang BAI was supported by National Program for Support of Top-Notch Young Professionals and in part by Program for HUST Academic Frontier Youth Team.

### References

[1] Gan C, Lin M, Yang Y, et al. Concepts not alone: exploring pairwise relationships for zero-shot video activity recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2016. 3487--3493. Google Scholar

[2] Han X, Wu Z X, Wu Z, et al. Viton: an image-based virtual try-on network. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018. 7543--7552. Google Scholar

[3] Kalayeh M M, Basaran E, Gökmen M, et al. Human semantic parsing for person re-identification. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018. 1062--1071. Google Scholar

[4] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2015. 3431--3440. Google Scholar

[5] Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2017. 2881--2890. Google Scholar

[6] Zhou Y Y, Wang Y, Tang P, et al. Semi-supervised 3D abdominal multi-organ segmentation via deep multi-planar co-training. In: Proceedings of 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019. 1550--5790. Google Scholar

[7] Luo Y W, Zheng Z D, Zheng L, et al. Macro-micro adversarial network for human parsing. In: Proceedings of European Conference on Computer Vision, 2018. 418--434. Google Scholar

[8] Chen L C, Papandreou G, Kokkinos I. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs.. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 834-848 CrossRef PubMed Google Scholar

[9] Nie X C, Feng J S, Yan S C. Mutual learning to adapt for joint human parsing and pose estimation. In: Proceedings of European Conference on Computer Vision, 2018. 502--517. Google Scholar

[10] Gong K, Liang X D, Li Y C, et al. Instance-level human parsing via part grouping network. In: Proceedings of European Conference on Computer Vision, 2018. 770--785. Google Scholar

[11] Liu T, Ruan T, Huang Z, et al. Devil in the details: towards accurate single and multiple human parsing. 2018,. arXiv Google Scholar

[12] Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. 2016,. arXiv Google Scholar

[13] Veličković P, Cucurull G, Casanova A, et al. Graph attention networks. 2017,. arXiv Google Scholar

[14] Xia F T, Wang P, Chen X J, et al. Joint multi-person pose estimation and semantic part segmentation. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2017. 6769--6778. Google Scholar

[15] Fang H-S, Lu G S, Fang X L, et al. Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[16] Liu S, Sun Y, Zhu D F, et al. Cross-domain human parsing via adversarial feature and label adaptation. 2018,. arXiv Google Scholar

[17] Liang X, Gong K, Shen X. Look into Person: Joint Body Parsing &amp; Pose Estimation Network and a New Benchmark.. IEEE Trans Pattern Anal Mach Intell, 2019, 41: 871-885 CrossRef PubMed Google Scholar

[18] Zhu B K, Chen Y Y, Tang M, et al. Progressive cognitive human parsing. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018. Google Scholar

[19] Guo L H, Guo C G, Li L. Two-stage local constrained sparse coding for fine-grained visual categorization. Sci China Inf Sci, 2018, 61: 018104 CrossRef Google Scholar

[20] Sun H Q, Pang Y W. GlanceNets - efficient convolutional neural networks with adaptive hard example mining. Sci China Inf Sci, 2018, 61: 109101 CrossRef Google Scholar

[21] Xu Y, Wang Y, Zhou W, et al. TextField: learning a deep direction field for irregular scene text detection. IEEE Trans Image Process, 2019. doi: 10.1109/TIP.2019.2900589. Google Scholar

[22] Krähenbühl P, Koltun V. Efficient inference in fully connected crfs with gaussian edge potentials. In: Proceedings of Advances in Neural Information Processing Systems, 2011. Google Scholar

[23] Ke T-W, Hwang J-J, Liu Z W, et al. Adaptive affinity field for semantic segmentation. 2018. 587--602. Google Scholar

[24] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2014. Google Scholar

[25] Gong K, Liang X D, Zhang D Y, et al. Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2017. 932--940. Google Scholar

[26] Jin J W, Liu Z L, Chen C L P. Discriminative graph regularized broad learning system for image recognition. Sci China Inf Sci, 2018, 61: 112209 CrossRef Google Scholar

[27] Liang X D, Lin L, Shen X H, et al. Interpretable structure-evolving lstm. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2017. 1010--1019. Google Scholar

[28] Zhang H, Dana K, Shi J P, et al. Context encoding for semantic segmentation. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2018. 7151--7160. Google Scholar

[29] Wang X L, Girshick R, Gupta A, et al. Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 7794--7803. Google Scholar

[30] Huang Z, Wang X, Huang L, et al. Ccnet: criss-cross attention for semantic segmentation. 2018,. arXiv Google Scholar

[31] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2016. 770--778. Google Scholar

[32] Chen X J, Mottaghi R, Liu X B, et al. Detect what you can: detecting and representing objects using holistic models and body parts. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2014. 1971--1978. Google Scholar

[33] Luo P, Wang X G, Tang X O. Pedestrian parsing via deep decompositional network. In: Proceedings of IEEE International Conference on Computer Vision, 2013. 2648--2655. Google Scholar

[34] Lin T-Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context. In: Proceedings of European Conference on Computer Vision, 2014. Google Scholar

[35] Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2481-2495 CrossRef PubMed Google Scholar

[36] Chen L-C, Yang Y, Wang J, et al. Attention to scale: scale-aware semantic image segmentation. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2016. 3640--3649. Google Scholar

[37] Liang X, Shen X, Feng J, et al. Semantic object parsing with graph lstm. In: Proceedings of European Conference on Computer Vision, 2016. Google Scholar

[38] Luc P, Couprie C, Chintala S, et al. Semantic segmentation using adversarial networks. In: Proceedings of NIPS Workshop, 2016. Google Scholar

• Figure 1

(Color online) Overall pipeline of the proposed method. FCMs are inserted into a human parsing model (solid black path), resulting in a more robust contextual feature and a more consistent parsing result.

• Figure 2

(Color online) Illustration of feature context module (FCM). FCM consists of two parts: (1) context affinity estimation which computes the correlation of each pixel with its adjacent pixels (e.g., 4 adjacent pixels under 4-connectivity and the pixel itself), resulting in a 5-channel affinity weight map; (2) feature context aggregation which enriches the input feature with the contextual feature in a weighted sum manner guided by the affinity weight map.

• Figure 3

(Color online) Visualization of the effect of FCM on the feature maps. The right column shows the learned affinity weight map ($K=4$) for the positions marked by red crisscross in input images. Parsing result on $F_b$ exhibits inconsistency on the marked positions. The proposed FCM tends to assign higher weights to correctly parsed context pixels of the same semantics (e.g., up and down neighbors in the first row, and all four neighbors except the pixel itself in the second row), alleviating the inconsistency problem.

• Figure 4

(Color online) Some qualitative illustrations on LIP in the red box, PASCAL-Person-Part in the green box, and CIHP in the blue box. For each example, from left to right: input image, groundtruth, result given by DeepLab-v2 baseline, and result of DeepLab-v2+FCM. The red rectangle and oval markers refer to inconsistency regions containing scatters and imprecise boundaries, respectively.

• Figure 5

(Color online) Examples of parsing results on the PPSS testing dataset of DeepLab-v2+FCM. The model is trained on the LIP dataset without further finetuning. Severe occlusion is present for images in (b). (a) Normal cases; protectłinebreak(b) cases with severe occlusion.

• Table 1   Quantitative comparison (IoU (%) for each class and mIoU (%)) with state-of-the-art methods on the LIP validation set$^{\rm~a)b)}$
 Method Hat Hair Glov Sung Clot Dress Coat Sock Pant Suit SegNet [35] 26.60 44.01 0.01 0.00 34.46 0.00 15.97 3.59 33.56 0.01 FCN-8s [4] 39.79 58.96 5.32 3.08 49.08 12.36 26.82 15.66 49.41 6.48 Attention [36] 58.87 66.78 23.32 19.48 63.20 29.63 49.70 35.23 66.04 24.73 SSL [25] 59.75 67.25 28.95 21.57 65.30 29.49 51.92 38.52 68.02 24.48 MuLA [9]$^{\rm~c)}$ – – – – – – – – – – MMAN [7] 57.66 65.63 30.07 20.02 64.15 28.39 51.98 41.46 71.03 23.61 MMAN+FCM 60.58 68.98 30.78 25.00 65.01 29.40 51.57 43.09 70.19 21.77 PGN [10]$^{\rm~d)}$ 61.53 69.13 34.13 26.99 68.17 34.93 55.78 42.50 70.69 25.30 PGN+FCM 64.00 70.61 36.74 30.88 68.66 33.42 55.92 46.67 71.99 27.54 DeepLab [8] 59.46 67.54 32.62 25.49 65.78 31.94 55.43 39.80 70.45 24.70 DeepLab+FCM 65.70 71.32 37.96 33.37 68.26 33.74 54.96 47.79 72.58 28.43 CE2P [11] 65.29 72.54 39.09 32.73 69.46 32.52 56.28 49.67 74.11 27.23 CE2P+FCM 66.31 73.58 40.21 34.03 70.69 33.27 55.63 50.62 75.32 29.83 Method Scarf Skirt Face l-arm r-arm l-leg r-leg l-sh r-sh bkg mIoU SegNet [35] 0.00 0.00 52.38 15.30 24.23 13.82 13.17 9.26 6.47 70.62 18.17 FCN-8s [4] 0.00 2.16 62.65 29.78 36.63 28.12 26.05 17.76 17.70 78.02 28.29 Attention [36] 12.84 20.41 70.58 50.17 54.03 38.35 37.70 26.20 27.09 84.00 42.92 SSL [25] 14.92 24.32 71.01 52.64 55.79 40.23 38.80 28.08 29.03 84.56 44.73 MuLA [9]$^{\rm~c)}$ – – – – – – – – – – 49.30 MMAN [7] 9.65 23.20 69.54 55.30 58.13 51.90 52.17 38.58 39.05 85.75 46.81 MMAN+FCM 10.63 20.41 72.56 58.02 60.75 52.13 51.61 39.25 39.80 85.28 47.84 PGN [10]$^{\rm~d)}$ 16.05 24.79 73.74 59.33 60.78 47.47 46.62 32.74 33.75 85.67 48.51 PGN+FCM 21.60 24.42 73.49 61.76 63.14 52.13 50.93 40.00 40.45 86.58 51.05 DeepLab [8] 15.51 28.13 70.53 55.76 58.56 48.99 49.49 36.76 36.79 85.49 47.91 DeepLab+FCM 23.53 26.16 74.38 60.11 62.71 50.01 49.46 38.41 38.90 86.77 51.23 CE2P [11] 14.19 22.51 75.50 65.14 66.59 60.10 58.59 46.63 46.12 87.67 53.10 CE2P+FCM 20.66 21.93 76.32 67.73 68.17 61.09 58.27 47.52 47.38 88.56 54.36
• Table 2   Quantitative evaluation (IoU (%) for each class and mIoU (%)) on the PASCAL-Person-Part test set$^{\rm~a)b)}$
 Method Head Torso u-arm l-arm u-leg l-leg bkg mIoU Attention [36] 81.47 59.06 44.15 42.50 38.28 35.62 93.65 56.39 SSL [25] 83.26 62.40 47.80 45.58 42.32 39.48 94.68 59.36 MMAN [7] 82.58 62.83 48.49 47.37 42.80 40.40 94.92 59.91 Structure-evolving LSTM [27] 82.89 67.15 51.42 48.72 51.72 45.91 97.18 63.57 DeepLab-ASPP [8] – – – – – – – 64.94 MuLA [9]$^{\rm~c)}$ – – – – – – – 65.10 PCNet [18] 86.81 69.06 55.35 55.27 50.21 48.54 96.07 65.90 DeepLab [8] 85.67 67.12 54.00 54.41 47.06 43.63 95.16 63.86 DeepLab+FCM 86.42 70.37 59.57 59.35 51.22 47.12 95.33 67.05 PGN [10] 90.89 75.12 55.83 64.61 55.42 41.57 95.33 68.40 PGN+FCM 91.16 76.45 57.77 66.23 56.58 43.21 95.41 69.54

a

• Table 3   Quantitative evaluation (IoU (%) for each class and mIoU (%)) on the CIHP validation set$^{\rm~a)b)c)}$
 Method Hat Hair Glov Sung Clot Dress Coat Sock Pant Suit DeepLab [8] 65.70 78.89 22.33 50.81 64.18 52.28 62.57 30.95 70.60 70.23 DeepLab+FCM 70.11 81.71 25.04 56.92 68.95 56.76 65.61 33.32 73.75 73.99 PGN [10] 68.78 78.61 23.07 47.14 67.42 55.11 63.97 26.68 72.16 71.56 PGN+FCM 65.41 79.04 30.57 54.19 67.16 54.36 63.72 30.74 73.04 71.93 Method Scarf Skirt Face l-arm r-arm l-leg r-leg l-sh r-sh bkg mIoU DeepLab [8] 28.42 37.53 86.60 24.69 31.88 25.91 22.66 19.15 18.56 92.01 47.80 DeepLab+FCM 30.64 37.98 88.57 29.98 34.06 30.73 21.13 21.80 20.13 93.82 50.75 PGN [10] 29.55 38.56 86.54 67.88 68.94 54.21 54.62 38.60 38.74 93.21 57.27 PGN+FCM 33.41 37.74 86.86 68.20 69.19 55.75 56.07 40.00 40.66 93.37 58.57
• Table 4   Analysis of inference time (s) on the LIP dataset
 Backbone Inference time +FCM Increase time MMAN [7] 0.048 0.051 $\uparrow$0.003 PGN [10] 0.231 0.243 $\uparrow$0.012 DeepLab [8] 0.021 0.025 $\uparrow$0.004
• Table 5   Cross-dataset evaluation (IoU (%) for each class and mIoU (%)) on the PPSS test set using the model trained on the LIP dataset$^{\rm~a)b)}$
 Method Hair Face u-c Arms l-c Legs bkg mIoU DL [33] 22.0 29.1 57.3 10.6 46.1 12.9 68.6 35.2 DDN [33] 35.5 44.1 68.4 17.0 61.7 23.8 80.0 47.2 ASN[38] 51.7 51.0 65.9 29.5 52.8 20.3 83.8 50.7 MMAN[7] 53.1 50.2 69.0 29.4 55.9 21.4 85.7 52.1 MMAN+FCM 60.0 70.7 75.5 62.6 43.0 42.7 94.4 64.1 PGN[10] 55.5 62.4 70.3 56.3 29.3 24.4 97.9 56.6 PGN+FCM 62.0 67.4 74.0 64.3 39.2 35.1 96.8 62.7 DeepLab [8] 65.8 59.5 84.5 76.3 35.0 25.6 90.4 62.4 DeepLab+FCM 64.1 72.6 80.6 67.3 48.7 39.1 94.8 66.7

a

• Table 6   Ablation study on the LIP dataset$^{\rm~a)}$
 Method mIoU (%) DeepLab 47.91 DeepLab + extra resblocks 48.07 DeepLab + CRF [22] 48.53 DeepLab + Non-local [29] 49.48 DeepLab + FCM ($\lambda_b=0$) 50.76 DeepLab + FCM 51.23 DeepLab + FCM $^{\rm~b)}$ 51.65
• Table 7   Quantitative analysis of the contextual region incorporated for the parsing performance on the LIP dataset$^{\rm~a)b)}$
 $M$ $K$ Contextual region mIoU (%) 1 4 4 49.13 1 8 8 50.04 2 4 12 51.23 2 8 24 49.95 3 4 24 50.16

a

Citations

• #### 0

Altmetric

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有