SCIENCE CHINA Information Sciences, Volume 63 , Issue 4 : 140305(2020) https://doi.org/10.1007/s11432-019-2791-7

Hybrid first and second order attention Unet for building segmentation in remote sensing images

More info
  • ReceivedNov 1, 2019
  • AcceptedFeb 11, 2020
  • PublishedMar 9, 2020


Recently, building segmentation (BS) has drawn significant attention in remote sensing applications. Convolutional neural networks (CNNs) have become the mainstream analysis approach in this field owing to their powerful representative ability. However, owing to the variation in building appearance, designing an effective CNN architecture for BS still remains a challenging task. Most of CNN-based BS methods mainly focus on deep or wide network architectures, neglecting the correlation among intermediate features. To address this problem, in this paper we propose a hybrid first and second order attention network (HFSA) that explores both the global mean and the inner-product among different channels to adaptively rescale intermediate features. As a result, the HFSA can not only make full use of first order feature statistics, but also incorporate the second order feature statistics, which leads to more representative feature. We conduct a series of comprehensive experiments on three widely used aerial building segmentation data sets and one satellite building segmentation data set. The experimental results show that our newly developed model achieves better segmentation performance over state-of-the-art models in terms of both quantitative and qualitative results.


This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61922029, 61771192), National Natural Science Foundation of China for International Cooperation and Exchanges (Grant No. 61520106001), and Huxiang Young Talents Plan Project of Hunan Province (Grant No. 2019RS2016).


[1] Jensen J R, Cowen D C. Remote sensing of urban suburban infrastructure and socio-economic attributes. Photogramm Eng Remote Sens, 1999, 65: 611--622. Google Scholar

[2] Yuan J. Learning building extraction in aerial scenes with convolutional networks. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 2793-2798 CrossRef PubMed Google Scholar

[3] Liow Y T, Pavlidis T. Use of shadows for extracting buildings in aerial images. Comput Vision Graph Image Process, 1990, 49: 242--277. Google Scholar

[4] Ok A O. Automated detection of buildings from single VHR multispectral images using shadow information and graph cuts. ISPRS J Photogrammetry Remote Sens, 2013, 86: 21-40 CrossRef Google Scholar

[5] Inglada J. Automatic recognition of man-made objects in high resolution optical remote sensing images by SVM classification of geometric image features. ISPRS J Photogrammetry Remote Sens, 2007, 62: 236-248 CrossRef Google Scholar

[6] Karantzalos K, Paragios N. Recognition-driven two-dimensional competing priors toward automatic and accurate building detection. IEEE Trans Geosci Remote Sens, 2009, 47: 133-144 CrossRef Google Scholar

[7] Kim T, Muller J. Development of a graph-based approach for building detection. Image Vision Comput, 1999, 17: 3-14 CrossRef Google Scholar

[8] Li E, Femiani J, Xu S. Robust rooftop extraction from visible band images using higher order CRF. IEEE Trans Geosci Remote Sens, 2015, 53: 4483-4495 CrossRef Google Scholar

[9] Yang H L, Yuan J, Lunga D. Building extraction at scale using convolutional neural network: mapping of the united states. IEEE J Sel Top Appl Earth Observations Remote Sens, 2018, 11: 2600-2614 CrossRef Google Scholar

[10] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25, Curran Associates, 2012. 1097--1105. Google Scholar

[11] Zhou Q, Wang Y, Liu J. An open-source project for real-time image semantic segmentation. Sci China Inf Sci, 2019, 62: 227101 CrossRef Google Scholar

[12] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. Google Scholar

[13] Wang W, Gao W, Hu Z. Effectively modeling piecewise planar urban scenes based on structure priors and CNN. Sci China Inf Sci, 2019, 62: 29102 CrossRef Google Scholar

[14] Ronneberger O, Fischer P, Brox T. Unet: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention---MICCAI 2015. Cham: Springer International Publishing, 2015. 234--241. Google Scholar

[15] Lu Y, Zhen M, Fang T. Multi-view based neural network for semantic segmentation on 3D scenes. Sci China Inf Sci, 2019, 62: 229101 CrossRef Google Scholar

[16] Badrinarayanan V, Kendall A, Cipolla R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2481-2495 CrossRef PubMed Google Scholar

[17] Geng Q, Zhou Z, Cao X. Survey of recent progress in semantic image segmentation with CNNs. Sci China Inf Sci, 2018, 61: 051101 CrossRef Google Scholar

[18] Haut J M, Paoletti M E, Plaza J. Visual attention-driven hyperspectral image classification. IEEE Trans Geosci Remote Sens, 2019, 57: 8065-8080 CrossRef Google Scholar

[19] He N, Fang L, Li S. Remote sensing scene classification using multilayer stacked covariance pooling. IEEE Trans Geosci Remote Sens, 2018, 56: 6899-6910 CrossRef Google Scholar

[20] He N, Fang L, Li S, et al. Skip-connected covariance network for remote sensing scene classification. In: IEEE Transactions on Neural Networks and Learning Systems, IEEE, 2019. 1--14. Google Scholar

[21] Lin T Y, Maji S. Improved Bilinear Pooling with CNNs. In: British Machine Vision Conference (BMVC), 2017. Google Scholar

[22] Lin T Y, RoyChowdhury A, Maji S. Bilinear CNN models for fine-grained visual recognition. In: Internation Conference of Computer Vision (ICCV), 2015, 1449--1457. Google Scholar

[23] Mnih V. Machine learning for aerial image labeling. PhD Dissertation. Toronto: University of Toronto, 2013. Google Scholar

[24] Ji S, Wei S, Lu M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans Geosci Remote Sens, 2019, 57: 574-586 CrossRef Google Scholar

[25] Maggiori E, Tarabalka Y, Charpiat G, et al. Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark. In: IEEE International Geoscience and Remote Sensing Symposium (IGARSS). Fort Worth: IEEE, 2017. 3226--3229. Google Scholar

  • Figure 1

    (Color online) Block diagram of the proposed HFSA-Unet for building segmentation. The HFSA is equipped with skip connections to adaptively rescale intermediate features in the encoding stage with weights learned from the correlation of intermediate features in the decoding stage (i.e., the gating feature).

  • Figure 2

    Block diagram of the proposed HFSA network. The HFSA consists of two basic modules: first order channel attention (FOCA) and second order channel attention (SOCA). The GAP denotes global average pooling. $\bigotimes$ denotes element-wise multiplication, while $\bigoplus$ denotes element-wise addition.

  • Figure 3

    (Color online) Some samples in three considered test image data sets. (a) Massachusetts buildings data set. protectłinebreak(b) Inria building data set. (c) Hunan university building data set. The part above the yellow dotted line (i.e., Pailou road) is the test set, while the part below is the training set. (d) Wuhan University building data set.

  • Figure 4

    (Color online) Segmentation maps obtained by Unet and the proposed model of one representative image on the MBD dataset. From the first column to the last column, we display the input image, the ground-truth, the segmentation result obtained by Unet, and the segmentation result obtained by the proposed HFSA.

  • Figure 7

    Segmentation maps obtained by Unet and the proposed model of two representative images on the WHUBD dataset. From the first column to the last column we display the input image, the ground-truth, the segmentation result obtained by Unet, and the segmentation result obtained by the proposed HFSA-Unet.

  • Table 1   Comparison of segmentation results on four different data sets$^{\rm~a)}$
    Data Method Precision Recall F1-score IoU
    FCN[12] 77.2271.1973.9658.75
    MLP[1] 75.2676.6975.8761.20
    MBDUnet[14] 81.7677.4579.3665.95
    Unet$^{\rm~b)}$[24] 68.1074.6055.20
    SegNet[16] 69.8275.2172.0856.57
    HFSA-Unet 84.7579.0881.7569.23
    FCN[12] 88.1588.4788.2979.07
    MLP[1] 85.4687.8886.6276.43
    IBDUnet[14] 91.5786.0888.6879.72
    Unet$^{\rm~b)}$[24] 84.6082.1071.40
    SegNet[16] 88.9789.3089.1280.41
    HFSA-Unet 92.3089.8991.0783.63
    FCN[12] 72.9471.1472.0356.29
    HNUBDMLP[1] 68.5067.7468.1251.66
    Unet[14] 76.0166.8371.1355.19
    SegNet[16] 69.4068.5168.9552.61
    HFSA-Unet 76.3171.6573.9058.61
    FCN[12] 91.2592.5691.8985.00
    MLP[1] 90.8491.2591.0483.56
    Unet[14] 94.7392.1593.4287.66
    WHUBDUnet$^{\rm~b)}$[24] 90.0394.5086.80
    SegNet[16] 91.9391.9791.9585.10
    SiU-net$^{\rm~b)}$[24] 93.8093.9088.40
    HFSA-Unet 95.0995.1895.1390.72

    a) The best value is highlighted in bold. The average values over the whole data set are reported. b) The results are directly duplicated from that paper, while others are implemented by ourselves.

  • Table 2   Comparison of segmentation performance comparison between proposed model and its variants
    Data FOCA SOCA Precision Recall F1-score IoU
    MBD 81.7677.4579.3665.95
    checkmark 82.8077.9280.1967.02
    checkmark 80.8480.9280.6967.79
    checkmarkcheckmark 84.7579.0881.7569.23
    checkmark 90.7490.2590.2482.64
    checkmark 91.1388.7289.8981.66
    checkmarkcheckmark 92.3089.8991.0783.63
    checkmark 76.2169.2172.5456.91
    checkmark 75.5269.3272.2856.60
    checkmarkcheckmark 76.3171.6573.9058.61
    checkmark 93.1795.3494.2589.12
    checkmark 94.4294.3494.3889.36
    checkmarkcheckmark 95.0995.1895.1390.72
  • Table 3   Generalization ability comparison via transfer learning from source dataset to ($\rightarrow$) target dataset$^{\rm~a)}$
    Method MBD$\rightarrow$ HNUBDWHUBD$\rightarrow$ HNUBD
    FCN[12] 20.4811.41 20.9111.59
    MLP[1] 30.6918.13 34.2120.64
    Unet[14] 20.4511.38 39.2624.42
    SegNet[16] 22.4412.64 37.3722.97
    HFSA-Unet 35.8821.86 42.0426.62


Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号