SCIENCE CHINA Information Sciences, Volume 61, Issue 5: 051101(2018) https://doi.org/10.1007/s11432-017-9189-6

## Survey of recent progress in semantic image segmentation with CNNs

• AcceptedJul 20, 2017
• PublishedNov 17, 2017
Share
Rating

### Abstract

In recent years, convolutional neural networks (CNNs) are leading the way in many computer vision tasks, such as image classification, object detection, and face recognition. In order to produce more refined semantic image segmentation, we survey the powerful CNNs and novel elaborate layers, structures and strategies, especially including those that have achieved the state-of-the-art results on the Pascal VOC 2012 semantic segmentation challenge.Moreover, we discuss their different working stages and various mechanisms to utilize the structural and contextual information in the image and feature spaces. Finally, combining some popular underlying referential methods in homologous problems, we propose several possible directions and approaches to incorporate existing effective methods as components to enhance CNNs for the segmentation of specific semantic objects.

### Acknowledgment

This work was supported by National High-tech RD Program of China (863 Program) (Grant No. 2015AA016403) and National Natural Science Foundation of China (Grant Nos. 61572061, 61472020).

### References

[1] Liang G, Ca J, Liu X. Smart world: a better world. Sci China Inf Sci, 2016, 59: 043401 CrossRef Google Scholar

[2] Wang J, Lu Y, Liu J. A robust three-stage approach to large-scale urban scene recognition. Sci China Inf Sci, 2017, 60: 103101 CrossRef Google Scholar

[3] Wang W, Hu L, Hu Z. Energy-based multi-view piecewise planar stereo. Sci China Inf Sci, 2017, 60: 032101 CrossRef Google Scholar

[4] Hoiem D, Efros A A, Hebert M. Recovering Surface Layout from an Image. Int J Comput Vis, 2007, 75: 151-172 CrossRef Google Scholar

[5] Saxena A, Min Sun A, Ng A Y. Make3D: learning 3D scene structure from a single still image.. IEEE Trans Pattern Anal Mach Intell, 2009, 31: 824-840 CrossRef PubMed Google Scholar

[6] Gould S, Fulton R, Koller D. Decomposing a scene into geometric and semantically consistent regions. In: Proceedings of the IEEE International Conference on Computer Vision, Kyoto, 2009. 1--8. Google Scholar

[7] Gupta A, Efros A A, Hebert M. Blocks world revisited: image understanding using qualitative geometry and mechanics. In: Proceedings of European Conference on Computer Vision, Crete, 2010. 482--496. Google Scholar

[8] Zhao Y B, Zhu S C. Image parsing via stochastic scene grammar. In: Proceedings of the Conference and Workshop on Neural Information Processing System, Granada, 2011. 73--81. Google Scholar

[9] Ce Liu , Yuen J, Torralba A. Nonparametric Scene Parsing via Label Transfer.. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 2368-2382 CrossRef PubMed Google Scholar

[10] Stella X Y, Zhang H, Malik J. Inferring spatial layout from a single image via depth-ordered grouping. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Anchorage, 2008. Google Scholar

[11] Lee D C, Hebert M, Kanade T. Geometric reasoning for single image structure recovery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, 2009. 2136--2143. Google Scholar

[12] Zheng Y, Byeungwoo J, Xu D, et al. Image segmentation by generalized hierarchical fuzzy C-means algorithm. J Intell Fuzzy Syst, 2015, 28: 4024--4028. Google Scholar

[13] Liu C, Yuen J, Torralba A. SIFT flow: dense correspondence across scenes and its applications. IEEE Trans Softw Eng, 2010, 33: 978--994. Google Scholar

[14] Papandreou G, Chen L C, Murphy K P, et al. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1742--1750. Google Scholar

[15] Ghiasi G, Fowlkes C C. Laplacian pyramid reconstruction and refinement for semantic segmentation. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 519--534. Google Scholar

[16] Peng C, Zhang X Y, Yu G, et al. Large kernel matters---improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017. 4353--4361. Google Scholar

[17] Everingham M, Van Gool L, Williams C K I. The Pascal Visual Object Classes (VOC) Challenge. Int J Comput Vis, 2010, 88: 303-338 CrossRef Google Scholar

[18] Zheng S, Jayasumana S, Romera-Paredes B, et al. Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1529--1537. Google Scholar

[19] Lin G S, Shen C H, van den Hengel A, et al. Efficient piecewise training of deep structured models for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 3194--3203. Google Scholar

[20] Liu Z W, Li X X, Luo P, et al. Semantic image segmentation via deep parsing network. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1377--1385. Google Scholar

[21] Lin G S, Shen C H, Reid I, et al. Deeply learning the messages in message passing inference. Comput Sci, 2015, 71: 866--872. Google Scholar

[22] Shuai B, Zuo Z, Wang B, et al. Dag-recurrent neural networks for scene labeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 3620--3629. Google Scholar

[23] Kuen J, Wang Z H, Wang G. Recurrent attentional networks for saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 3668--3677. Google Scholar

[24] Liang X D, Shen X H, Xiang D L, et al. Semantic object parsing with local-global long short-term memory. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 3185--3193. Google Scholar

[25] Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1520--1528. Google Scholar

[26] Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. In: Proceedings of International Conference on Learning Representations, San Juan, 2016. Google Scholar

[27] Chen L C, Papandreou G, Kokkinos I, et al. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,. arXiv Google Scholar

[28] Sermanet P, Fergus R, LeCun Y, et al. Overfeat: integrated recognition, localization and detection using convolutional networks. In: Proceedings of International Conference on Learning Representations, Banff, 2014. Google Scholar

[29] Zeiler M D, Fergus R. Visualizing and understanding convolutional networks. In: Proceedings of European Conference on Computer Vision, Zurich, 2014. 818--833. Google Scholar

[30] Krähenbühl P, Koltun V. Efficient inference in fully connected CRFs with Gaussian edge potentials. In: Proceedings of Advances in Neural Information Processing Systems, Granada, 2011. 109--117. Google Scholar

[31] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of International Conference on Learning Representations, San Diego. 2015. Google Scholar

[32] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 770--778. Google Scholar

[33] Gao W, Zhou Z H. Dropout Rademacher complexity of deep neural networks. Sci China Inf Sci, 2016, 59: 072104 CrossRef Google Scholar

[34] Wu Z F, Shen C H, Hengel A. High-performance semantic segmentation using very deep fully convolutional networks,. arXiv Google Scholar

[35] Hariharan B, Arbeláez P, Girshick R, et al. Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 447--456. Google Scholar

[36] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 3431--3440. Google Scholar

[37] Xie S N, Tu Z W. Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1395--1403. Google Scholar

[38] Lin G S, Milan A, Shen C H, et al. RefineNet: multi-path refinement networks with identity mappings for high-resolution semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017. 1925--1934. Google Scholar

[39] Wu Z F, Shen C H, Hengel A. Wider or deeper: revisiting the ResNet model for visual recognition,. arXiv Google Scholar

[40] Hong S, Oh J, Lee H, et al. Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 3204--3212. Google Scholar

[41] Chen L C, Yang Y, Wang J, et al. Attention to scale: scale-aware semantic image segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 3640--3649. Google Scholar

[42] Liu S, Qi X J, Shi J P, et al. Multi-scale patch aggregation (MPA) for simultaneous detection and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 3141--3149. Google Scholar

[43] Bertasius G, Shi J, Torresani L. Semantic segmentation with boundary neural fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 3602--3610. Google Scholar

[44] Mostajabi M, Yadollahpour P, Shakhnarovich G. Feedforward semantic segmentation with zoom-out features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 3376--3385. Google Scholar

[45] Hong S, Noh H, Han B. Decoupled deep neural network for semi-supervised semantic segmentation. In: Proceedings of Advances in Neural Information Processing Systems, Montreal, 2015. 1495--1503. Google Scholar

[46] Arnab A, Jayasumana S, Zheng S, et al. Higher order conditional random fields in deep neural networks. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 524--540. Google Scholar

[47] Vemulapalli R, Tuzel O, Liu M Y, et al. Gaussian conditional random field network for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 3224--3233. Google Scholar

[48] Zhao H S, Shi J P, Qi X J, et al. Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017. 2881--2890. Google Scholar

[49] Yang J, Price B, Cohen S, et al. Object contour detection with a fully convolutional encoder-decoder network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 193--202. Google Scholar

[50] Lee C Y, Xie S, Gallagher P, et al. Deeply-supervised nets. In: Proceedings of Artificial Intelligence and Statistics, San Diego, 2015. 562--570. Google Scholar

[51] Kokkinos I. Pushing the boundaries of boundary detection using deep learning. In: Proceedings of International Conference on Learning Representations, San Juan, 2016. Google Scholar

[52] Giusti A, Ciresan D C, Masci J, et al. Fast image scanning with deep max-pooling convolutional neural networks. In: Proceedings of the 20th IEEE International Conference on Image Processing (ICIP), Melbourne, 2013. 4034--4038. Google Scholar

[53] Sutton C, Mccallum A. Piecewise training for undirected models. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence. Edinburgh: AUAI Press, 2005. 568--575. Google Scholar

[54] Adams A, Baek J, Davis M A. Fast High-Dimensional Filtering Using the Permutohedral Lattice. Comput Graphics Forum, 2010, 29: 753-762 CrossRef Google Scholar

[55] Dai J F, He K M, Sun J. Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, Santiago, 2015. 1635--1643. Google Scholar

[56] Rother C, Kolmogorov V, Blake A. "GrabCut". ACM Trans Graph, 2004, 23: 309-314 CrossRef Google Scholar

[57] Uijlings J R R, van de Sande K E A, Gevers T. Selective Search for Object Recognition. Int J Comput Vis, 2013, 104: 154-171 CrossRef Google Scholar

[58] Arbeláez P, Pont-Tuset J, Barron J, et al. Multiscale combinatorial grouping. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. 328--335. Google Scholar

[59] Krahenbühl P, Koltun V. Geodesic object proposals. In: Proceedings of European Conference on Computer Vision, Zurich, 2014. 725--739. Google Scholar

[60] Lin D, Dai J F, Jia J Y, et al. Scribblesup: scribble-supervised convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 3159--3167. Google Scholar

[61] Romera-Paredes B, Torr P H S. Recurrent instance segmentation. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 312--329. Google Scholar

[62] Dai J F, He K M, Sun J. Instance-aware semantic segmentation via multi-task network cascades. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. 3150--3158. Google Scholar

• Figure 1

(Color online) Examples from the segmentation subset [17].

• Figure 2

(Color online) Different semantic feature maps obtained from different VGG-16 layers The sizes of the feature maps have been adjusted for ease of observation.

• Figure 3

Distinct stages of applicability and different underlying mechanisms of the analyzed methods and structures.

• Figure 4

(Color online) Obvious false recognitions are evident in column (b), which can be corrected by using a CRF model. The post-processed results obtained using the CRF approach are shown in column (c).

• Figure 5

(Color online) Remarkable results produced by PSPNet [34].

• Figure 6

(Color online) Failure cases of PSPNet [34].

• Figure 7

(Color online) The general framework for image segmentation We divide it into two components: feature extractor and segmentation generator. The green boxes represent vital data used throughout the entire framework; the pink boxes represent functions used to process different data, where those presented with gray frames are optional structures. Different connections represent different paths.

• Table 1   PASCAL VOC 2012 Challenge Leaderboard (2015/10/2)
 Architecture Mean Architecture Mean Adelaide_Context_CNN_CRF_COCO [19] 77.8 DeepLab-CRF-COCO-LargeFOV [27] 72.7 CUHK_DPN_COCO [20] 77.5 POSTECH_EDeconvNet_CRF_VOC [25] 72.5 CentraleSuperBoundaries [31] 75.7 Oxford_TVG_CRF_RNN_VOC [18] 72.0 Adelaide_Context_CNN_CRF_VOC [19] 75.3 DeepLab-MSc-CRF_LargeFOV [27] 71.6 MSRA_BoxSup [32] 75.2 DeepLab-CRF-COCO-Strong [27] 70.4 POSTECH_DeconvNet_CRF_VOC [25] 74.8 DeepLab-CRF-LargeFOV [27] 70.3 Oxford_TVG_CRF_RNN_COCO [18] 74.7 TTI_zoomout_v2 [33] 69.6 DeepLab-MSc-CRF-LargeFOV-COCO-CrossJoint [27] 73.9
• Table 2   PASCAL VOC 2012 Challenge Leaderboard (2017/2/1)
 Architecture Mean Architecture Mean PSPNet [34] 85.4 CentraleSupelec Deep G-CRF 80.2 ResNet-38_COCO [35] 84.9 CMT-FCN-ResNet-CRF 80.0 Multipath-RefineNet [36] 84.2 DeepLabv2-CRF [27] 79.7 ResNet-38_MS [35] 83.1 CASIA_SegResNet_CRF_COCO 79.3 R4D_MultiScale_CRF 82.2 LRR_4x_ResNet_COCO [15] 79.3 SegModel 81.8 Adelaide_VeryDeep_FCN_VOC 79.1 HikSeg_COCO 81.4 LRR_4x_COCO [15] 78.7 DP_ResNet_CRF 81.0 CASIA_IVA_OASeg 78.3 OBP-HJLCN 80.4 Oxford_TVG_HO_CRF [37] 77.9
• Table 3   Effects of various methods for combining multi-granularity features
 Implementation Mean/Relative improvement FCN-8s [43] 62.2/0 DeconvNet [25] 69.6/7.4 EDeconvNet [25] 71.7/9.5 DecoupledNet-Full [49] 66.6/4.4 DPN [20] 74.1/11.9 DPN_With_COCO [20] 77.5/15.3 Hypercolumn Sys1 [42] 54.6/$-7.6$ Hypercolumn Sys2 [42] 62.6/0.4 Zoom-out [33] 69.6/7.4 LRR_4x_ResNet_COCO [15] 79.3/17.1 LRR_4x_COCO [15] 78.7/16.5 Multipath-RefineNet-Res152 [36] 83.4/21.2 DeepLab-CRF-COCO-LargeFOV-Attention [46] 75.1/12.9 DeepLab-CRF-COCO-LargeFOV-Attention+ [46] 75.7/13.5 TransferNet [45] 51.2/$-11$
• Table 4   Transforming a CRF into an RNN
 Step of iteration Implementation in the RNN Initialization Softmax layer Message passing Gaussian filters Weighting of filter outputs $1\times1$ convolutional layer Compatibility transform $1\times1$ convolutional layer Addition of unary potentials Elementwise layer Normalization Softmax layer
• Table 5   Underlying concepts of the summarized studies
 Concept Related structures Feature encoder VGG [38], ResNet [39] Upsampling of low-resolution features or score maps Unpooling layers [25,43], Deconvolution layers [25,43], Reconstruction [15] Reduction of the resolution loss Atrous and dilated convolution [26,27], Removing pooling layers [14], Shifting input and interlacing output [28], Multi-pass method [41] Enhancement of features Hypercolumns [42], Attentional model [23,46], Zoom-out [33], Context_CNN_CRF [19], CentraleSuperBoundaries [31] Selection of features DecoupledNet [49] Step-by-step refinement of intermediate segmentations Skip-layer architecture [43], Cascade-like structure [36], DeconvNet [25], DSN [52] Utilization of heterogeneous annotations BoxSup [32], DecoupledNet [49], Weakly and semi-supervised learning [14] Explicit propagation of context DAG [22], LG-LSTM [24], PSPNet [34] Learning of potentials Context_CNN_CRF [19], GCRF [50], High-order potential CRF [21,37], DPN [20] Solving of CRFs CRF-RNN [18], DPN [20], Adelaide_Learning_Messages [21]

Citations

• #### 0

Altmetric

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有