SCIENCE CHINA Information Sciences, Volume 63 , Issue 2 : 120106(2020) https://doi.org/10.1007/s11432-019-2738-y

## Preserving details in semantics-aware context for scene parsing

• AcceptedDec 24, 2019
• PublishedJan 15, 2020
Share
Rating

### Abstract

Great success of scene parsing (also known as, semantic segmentation) has been achieved with the pipeline of fully convolutional networks (FCNs). Nevertheless, there are a lot of segmentation failures caused by large similarities between local appearances. To alleviate the problem, most of existing methods attempt to improve the global view of FCNs by introducing different contextual modules. Though the reconstructed high resolution output of these methods is of rich semantics, it cannot faithfully recover the fine image details owing to lack of desired precise low-level information. To overcome the problem, we propose to improve the spatial decoding process through embedding possibly lost low-level information in a principled way. To this end, we make the following three contributions. First, we propose a semantics conformity module to make low-level features variations agnostic. Second, we introduce semantics into the conformed low level features through guidance from semantically aware features. Finally, we institute the availability of various possible contextual features at feature fusion to enrich context information. The proposed approach demonstrates competitive performance on challenging PASCAL VOC 2012, Cityscapes, and ADE20K benchmarks in comparison to the state-of-the-art methods.

### Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant No. 61632018) and Science and Technology Innovation 2030: the Key Project of Next Generation of Artificial Intelligence (Grant No. 2018AAA01028).

### References

[1] Chen S T, Jian Z Q, Huang Y H. Autonomous driving: cognitive construction and situation understanding. Sci China Inf Sci, 2019, 62: 81101 CrossRef Google Scholar

[2] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 3431--3440. Google Scholar

[3] Lin G, Shen C, van Den Hengel A, et al. Efficient piecewise training of deep structured models for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 3194--3203. Google Scholar

[4] Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 2650--2658. Google Scholar

[5] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, 2015. 234--241. Google Scholar

[6] Shah S, Ghosh P, Davis L-S, et al. Stacked U-Nets: a no-frills approach to natural image segmentation. 2018,. arXiv Google Scholar

[7] Zhou Q, Wang Y, Liu J. An open-source project for real-time image semantic segmentation. Sci China Inf Sci, 2019, 62: 227101 CrossRef Google Scholar

[8] Huang T, Xu Y, Bai S. Feature context learning for human parsing. Sci China Inf Sci, 2019, 62: 220101 CrossRef Google Scholar

[9] Chen L-C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs. 2014,. arXiv Google Scholar

[10] Chen L C, Papandreou G, Kokkinos I. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs.. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 834-848 CrossRef PubMed Google Scholar

[11] Schwing A-G, Urtasun R. Fully connected deep structured networks. 2015,. arXiv Google Scholar

[12] Zheng S, Jayasumana S, Romera-Paredes B, et al. Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1529--1537. Google Scholar

[13] Sun H Q, Pang Y W. GlanceNets - efficient convolutional neural networks with adaptive hard example mining. Sci China Inf Sci, 2018, 61: 109101 CrossRef Google Scholar

[14] Liu W, Rabinovich A, Berg A-C. Parsenet: Looking wider to see better. 2015,. arXiv Google Scholar

[15] Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 2881--2890. Google Scholar

[16] Fu J, Liu J, Tian H, et al. Dual attention network for scene segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 3146--3154. Google Scholar

[17] Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. 2015,. arXiv Google Scholar

[18] Chen L-C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation. 2017,. arXiv Google Scholar

[19] Yang M, Yu K, Zhang C, et al. Denseaspp for semantic segmentation in street scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 3684--3692. Google Scholar

[20] Chen Y, Rohrbach M, Yan Z, et al. Graph-based global reasoning networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, 2019. 433--442. Google Scholar

[21] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770--778. Google Scholar

[22] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014,. arXiv Google Scholar

[23] Chen J, Lian Z H, Wang Y Z. Irregular scene text detection via attention guided border labeling. Sci China Inf Sci, 2019, 62: 220103 CrossRef Google Scholar

[24] Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1520--1528. Google Scholar

[25] Jgou S, Drozdzal M, Vazquez D, et al. The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, 2017. 11--19. Google Scholar

[26] Ghiasi G, Fowlkes C-C. Laplacian pyramid reconstruction and refinement for semantic segmentation. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 519--534. Google Scholar

[27] Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 483--499. Google Scholar

[28] Liu N, Han J, Yang M-H. PiCANet: Learning pixel-wise contextual attention for saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 3089--3098. Google Scholar

[29] Shrivastava A, Sukthankar R, Malik J, et al. Beyond skip connections: Top-down modulation for object detection. 2016,. arXiv Google Scholar

[30] Lin T-Y, Dollr P, Girshick R, et al. Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 2117--2125. Google Scholar

[31] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 7132--7141. Google Scholar

[32] Zhang H, Dana K, Shi J, et al. Context encoding for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 7151--7160. Google Scholar

[33] Zhu Z, Xu M, Bai S, et al. Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, Seoul, 2019. 593--602. Google Scholar

[34] Zhou Q, Zheng B, Zhu W. Multi-scale context for scene labeling via flexible segmentation graph. Pattern Recogn, 2016, 59: 312-324 CrossRef Google Scholar

[35] Zhou Q, Yang W, Gao G. Multi-scale deep context convolutional neural networks for semantic segmentation. World Wide Web, 2019, 22: 555-570 CrossRef Google Scholar

[36] Dai J, Qi H, Xiong Y, et al. Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, Venice, 2017. 764--773. Google Scholar

[37] Huang G, Liu Z, van Der Maaten L, et al. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 4700--4708. Google Scholar

[38] Everingham M, Eslami S M A, Van Gool L. The Pascal Visual Object Classes Challenge: A Retrospective. Int J Comput Vis, 2015, 111: 98-136 CrossRef Google Scholar

[39] Hariharan B, Arbelez P, Bourdev L, et al. Semantic contours from inverse detectors. In: Proceedings of the IEEE International Conference on Computer Vision, Barcelona, 2011. 991--998. Google Scholar

[40] Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 3213--3223. Google Scholar

[41] Zhou B, Zhao H, Puig X, et al. Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 633--641. Google Scholar

[42] Chen L-C, Zhu Y, Papandreou G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 801--818. Google Scholar

[43] Liu Z, Li X, Luo P, et al. Semantic image segmentation via deep parsing network. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1377--1385. Google Scholar

[44] Wu Z, Shen C, van den Hengel A. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition. Pattern Recogn, 2019, 90: 119-133 CrossRef Google Scholar

[45] Lin G, Milan A, Shen C, et al. Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 1925--1934. Google Scholar

[46] Peng C, Zhang X, Yu G, et al. Large Kernel matters improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 4353--4361. Google Scholar

[47] Wang P, Chen P, Yuan Y, et al. Understanding convolution for semantic segmentation. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, 2018. 1451--1460. Google Scholar

[48] Ke T-W, Hwang J-J, Liu Z, et al. Adaptive Affinity Fields for Semantic Segmentation. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 587--602. Google Scholar

[49] Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2481-2495 CrossRef PubMed Google Scholar

[50] Zhou B, Zhao H, Puig X. Semantic Understanding of Scenes Through the ADE20K Dataset. Int J Comput Vis, 2019, 127: 302-321 CrossRef Google Scholar

• Figure 1

(Color online) First column: original image; second column: baseline predictions; and last column: predictions from the proposed approach. Baseline fails to recover many details, inside object or around object boundaries (marked in red boxes). Whereas the proposed approach convincingly segments them, for instance, fine spokes of wheel in the first image, leg of the rider in the second image, right wing boundaries of the plane in the last image.

• Figure 2

(Color online) The network architecture of our overall framework. We propose semantic conformity to adjust to local deformations in possibly messy high resolution representation and design attention gating to compensate semantics in high resolution features through guidance for enriching spatial details in context-aware, semantically richer, low resolution feature maps. Finally, we introduce hierarchical cues fusion along the proposed spatial decoding to enrich contextual information after the fusion of (adjusted and compensated) low level features. We display the impact of SCM, SCM+AG, and SCM+AG+HCF on feature maps in Figure 2(b). SCM reduces noise through adjusting local features, while AG on top of it improves the semantics such as enhancing cars boundaries, and finally HCF further enhances foreground (cars) and simultaneously suppresses background with the fusion of complementary contextual features. (a) Overview of our method; (b) the effect of different module.

• Figure 3

(Color online) Semantics conformity module. It adjusts local variations caused by various geometric deformations via bottleneck block featuring deformable convolution, thereby preparing them before semantics introduction.

• Figure 4

(Color online) Semantics enhancement via attention gating. We compensate semantics in high resolution features through guidance for enriching spatial details in context-aware, semantically richer, low resolution feature maps.

• Figure 5

(Color online) Hierarchical cue fusion. We tend to enrich the context information while fusing the contradictory (adjusted and compensated) low level features through exploiting all possible previous contextual features.

• Figure 6

(Color online) Contribution of each proposed module towards final segmentation. Two different examples are shown. (a) Image; (b) baseline; (c) SCM; (d) SCM+AG; (e) all.

• Figure 7

(Color online) Qualitative comparison between baseline and the proposed approach on PASCAL VOC 2012 validation set. Further, it almost misses bird legs whereas our approach finely recovers them. (a) Image; (b) baseline; protectłinebreak (c) ours; (d) groundtruth.

• Figure 8

(Color online) Some example segmentation maps from our approach on Cityscapes dataset. Proposed approach well distinguishes objects appearing at various scales while preserving details across their boundaries. Example objects are people and cars.

• Figure 9

(Color online) Example segmentation maps from proposed approach on the ADE20K validation set. Our approach accurately preserves quite delicate boundary details around variably sized objects, for instance, segmentation output for the intermingled leaves of the tree in first example image. (a) Image; (b) baseline; (c) ours; (d) groundtruth.

• Table 1   Performance of SCM with different convolutional variants. We see the deformable convolution adjusts to local variations better amongst others.
 SCM mIoU (%) Standard $3\times3$ convolution 77.40 Dilated $3\times3$ convolution (rate = 2) 77.61 Deformable $3\times3$ convolution 77.76
• Table 2   Ablation study on PASCAL VOC 2012 validation set$^{\rm~a)}$. We observe that AG requires SCM to adapt to local feature variations and thus shows improvement. Further, HCF boosts performance both in isolation as well as along with other components.
 Method SCM AG HCF mIoU (%) Deeplabv3[18] – – – 77.21 checkmark – – 77.76 – checkmark – 77.55 Ours – – checkmark 77.91 checkmark checkmark – 78.00 checkmark – checkmark 78.11 checkmark checkmark checkmark 78.46

a) Abbreviations used stand for the following: SCM is the semantics conformity module, AG is the semantics enhancement via attention gating component, and HCF is the hierarchical cues fusion process.

• Table 3   Performance comparisons upon employing different inference strategies on the PASCAL VOC 2012 validation set similar to $^{\rm~a)}$. We see each inference strategy boosting performance by noticeable margins, however, the best performanceof $80.46%$ is obtained after using MS and Flip inputs with output stride = 8.
 Method OS = 16 OS = 8 MS Flip mIoU (%) checkmark – – – 78.46 checkmark – checkmark – 79.65 Ours checkmark – checkmark checkmark 80.09 – checkmark – – 78.76 – checkmark checkmark – 79.97 – checkmark checkmark checkmark 80.46

a) OS: output stride. MS: multi-scale inputs. Flip: adding left-right flipped inputs.

• Table 4   Per-class comparison of the proposed approach with the state-of-the-art the PASCAL VOC 2012 test set
 Method aero bike bird boat bottle bus car cat chair cow FCN[2] 76.8 34.2 68.9 49.4 60.3 75.3 74.7 77.6 21.4 62.5 FSG[34] – – – – – – – – – – DeepLab[10] 84.4 54.5 81.5 63.6 65.9 85.1 79.1 83.4 30.7 74.1 CRF-RNN[12] 87.5 39.0 79.7 64.2 68.3 87.6 80.8 84.4 30.4 78.2 DeconvNet[24] 89.9 39.3 79.7 63.9 68.2 87.4 81.2 86.1 28.5 77.0 DPN[43] 87.7 59.4 78.4 64.9 70.3 89.3 83.5 86.1 31.7 79.9 Piecewise[3] 90.6 37.6 80.0 67.8 74.4 92.0 85.2 86.2 39.1 81.2 MDCCNet[35] 87.6 43.7 85.3 72.3 83.0 91.7 86.5 89.9 43.8 80.5 ResNet38[44] 94.4 72.9 94.9 68.8 78.4 90.6 90.0 92.1 40.1 90.4 PSPNet [15] 91.8 71.9 94.7 71.2 75.8 95.2 89.9 95.9 39.3 90.7 Ours 95.4 72.7 93.2 69.6 77.1 95.3 91.5 94.9 40.2 87.6 Method table dog horse mbike person plant sheep sofa train tv mIoU (%) FCN[2] 46.8 71.8 63.9 76.5 73.9 45.2 72.4 37.4 70.9 55.1 62.2 FSG[34] – – – – – – – – – – 64.4 DeepLab[10] 59.8 79.0 76.1 83.2 80.8 59.7 82.2 50.4 73.1 63.7 71.6 CRF-RNN[12] 60.4 80.5 77.8 83.1 80.6 59.5 82.8 47.8 78.3 67.1 72.0 DeconvNet[24] 62.0 79.0 80.3 83.6 80.2 58.8 83.4 54.3 80.7 65.0 72.5 DPN[43] 62.6 81.9 80.0 83.5 82.3 60.5 83.2 53.4 77.9 65.0 74.1 Piecewise[3] 58.9 83.8 83.9 84.3 84.8 62.1 83.2 58.2 80.8 72.3 75.3 MDCCNet[35] 50.6 84.2 79.7 81.0 86.6 61.5 85.7 55.6 86.3 74.8 75.5 ResNet38[44] 71.7 89.9 93.7 91.0 89.1 71.3 90.7 61.3 87.7 78.1 82.5 PSPNet [15] 71.7 90.5 94.5 88.8 89.6 72.8 89.6 64.0 85.1 76.3 82.6 Ours 69.4 91.0 91.0 92.3 88.7 64.8 89.4 61.1 84.7 74.2 81.9
• Table 5   Experiments with different inference strategies on Cityscapes validation set. All different inference strategies show performance gains. Noticeably, MS input strategy provides the highest absolute gain of $1.17%$ and $1.14%$ at output stride = 16 and output stride = 8, respectively, amongst others$^{\rm~a)}$.
 Method OS = 16 OS = 8 MS Flip mIoU (%) checkmark – – – 77.90 checkmark – checkmark – 79.07 Ours checkmark – checkmark checkmark 79.20 – checkmark – – 78.14 – checkmark checkmark – 79.28 – checkmark checkmark checkmark 79.40

a) OS: output stride. MS: multi-scale inputs. Flip: adding left-right flipped inputs.

• Table 6   Category-wise comparison of proposed approach with the existing state-of-the-art on Cityscapes test set. Note, our result is obtained without using coarse annotations.
 Method road swalk build wall fence pole tlight sign veg terrain CRF-RNN[12] 96.3 73.9 88.2 47.6 41.3 35.2 49.5 59.7 90.6 66.1 FCN[2] 97.4 78.4 89.2 34.9 44.2 47.4 60.1 65.0 91.4 69.3 Dilation10[17] 97.6 79.2 89.9 37.3 47.6 53.2 58.6 65.2 91.8 69.4 DeepLab[10] 97.9 81.3 90.3 48.8 47.4 49.6 57.9 67.3 91.9 69.4 RefineNet[45] 98.2 83.3 91.3 47.8 50.4 56.1 66.9 71.3 92.3 70.3 GCN[46] – – – – – – – – – – DUC[47] 98.5 85.5 92.8 58.6 55.5 65.0 73.5 77.9 93.3 72.0 PSPNet[15] 98.6 86.2 92.9 50.8 58.8 64.0 75.6 79.0 93.4 72.3 AAF[48] 98.5 85.6 93.0 53.8 59.0 65.9 75.0 78.4 93.7 72.4 Ours 98.6 86.2 93.1 54.2 60.5 67.4 74.9 79.1 93.6 71.3 Method sky person rider car texttruck bus train mbike bike mIoU (%) CRF-RNN[12] 93.5 70.4 34.7 90.1 39.2 57.5 55.4 43.9 54.6 62.5 FCN[2] 93.9 77.1 51.4 92.6 35.3 48.6 46.5 51.6 66.8 65.3 Dilation10[17] 93.7 78.9 55.0 93.3 45.5 53.4 47.7 52.2 66.0 67.1 DeepLab[10] 94.2 79.8 59.8 93.7 56.5 67.5 57.5 57.7 68.8 70.4 RefineNet[45] 94.8 80.9 63.3 94.5 64.6 76.1 64.3 62.2 70.0 73.6 GCN[46] – – – – – – – – – 76.9 DUC[47] 95.2 84.8 68.5 95.4 70.9 78.8 68.7 65.9 73.8 77.6 PSPNet[15] 95.4 86.5 71.3 95.9 68.2 79.5 73.8 69.5 77.2 78.4 AAF[48] 95.6 86.4 70.5 95.9 73.9 82.7 76.9 68.7 76.4 79.1 Ours 95.8 86.7 71.0 96.0 73.0 87.8 85.9 68.8 76.4 80.0
• Table 7   Performance comparison of our method with the existing state-of-the-art approaches on ADE20K validation set. Employing the same backbone network, it surpasses PSPNet .
 Method Backbone mIoU (%) FCN[2] – 29.39 SegNet[49] – 21.64 DilatedNet[17] – 32.31 CascadeNet[50] – 34.90 RefineNet[45] ResNet152 40.70 PSPNet[15] ResNet101 43.29 Ours ResNet101 43.76

Citations

• #### 0

Altmetric

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有