logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 3 : 130105(2021) https://doi.org/10.1007/s11432-020-3065-4

Deep graph cut network for weakly-supervised semantic segmentation

More info
  • ReceivedMar 15, 2020
  • AcceptedJul 30, 2020
  • PublishedFeb 7, 2021

Abstract


Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant Nos. 61876212, 61733007) and Zhejiang Lab (Grant No. 2019NB0AB02).


References

[1] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2012. 1097--1105. Google Scholar

[2] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3431--3440. Google Scholar

[3] Huang Z L, Wang X G, Wei Y C, et al. CCNet:criss-cross attention for semantic segmentation. 2020. arXiv:1811.11721. Google Scholar

[4] Kolesnikov A, Lampert C H. Seed, expand and constrain: three principles for weakly-supervised image segmentation. In: Proceedings of European Conference on Computer Vision. Berlin: Springer, 2016. 695--711. Google Scholar

[5] Huang Z, Wang X, Wang J, et al. Weakly-supervised semantic segmentation network with deep seeded region growing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 7014--7023. Google Scholar

[6] Zhou Z H. A brief introduction to weakly supervised learning. Natl Sci Rev, 2018, 5: 44-53 CrossRef Google Scholar

[7] Wei Y, Xiao H, Shi H, et al. Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 7268--7277. Google Scholar

[8] Tang P, Wang X G, Bai S, et al. PCL: proposal cluster learning forweakly supervised object detection. IEEE Trans Pattern Anal Mach Intell, 2020, 42: 176--191. Google Scholar

[9] Wang X G, Deng X B, Fu Q, et al. A weaklysupervisedframework for COVID-19 classiffication and lesion localization from chest CT. IEEE Trans Medical Imaging, 2020,39: 2615--2625. Google Scholar

[10] Zhou B, Khosla A, Lapedriza A, et al. Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2921--2929. Google Scholar

[11] Lee J, Kim E, Lee S, et al. Ficklenet: weakly and semi-supervised semantic image segmentation using stochastic inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 5267--5276. Google Scholar

[12] Wei Y, Feng J, Liang X, et al. Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1568--1576. Google Scholar

[13] Ahn J, Kwak S. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 4981--4990. Google Scholar

[14] Fan J S, Zhang Z X, Tan T N. Cian: cross-image affinity net for weakly supervised semantic segmentation. 2018,. arXiv Google Scholar

[15] Boykov Y Y, Jolly M P. Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In: Proceedings of the 8th IEEE International Conference on Computer Vision, 2001. 105--112. Google Scholar

[16] Boykov Y, Kolmogorov V. An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. IEEE Trans Pattern Anal Machine Intell, 2004, 26: 1124-1137 CrossRef Google Scholar

[17] Dai J, He K, Sun J. Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 1635--1643. Google Scholar

[18] Tang M, Perazzi F, Djelouah A, et al. On regularized losses for weakly-supervised CNN segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018. 507--522. Google Scholar

[19] Papandreou G, Chen L C, Murphy K, et al. Weakly-and semi-supervised learning of a dcnn for semantic image segmentation. 2015,. arXiv Google Scholar

[20] Tang M, Djelouah A, Perazzi F, et al. Normalized cut loss for weakly-supervised cnn segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1818--1827. Google Scholar

[21] Bearman A, Russakovsky O, Ferrari V, et al. What's the point: semantic segmentation with point supervision. In: Proceedings of European Conference on Computer Vision. Berlin: Springer, 2016. 549--565. Google Scholar

[22] Lee J, Kim E, Lee S, et al. Frame-to-frame aggregation of active regions in web videos for weakly supervised semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 6808--6818. Google Scholar

[23] Yang J, Sun X, Lai Y K, et al. Recognition from web data: a progressive filtering approach. IEEETrans Image Process, 2018, 27: 5303--5315. Google Scholar

[24] Wang Y, Zhang J, Kan M, et al. Self-supervised scale equivariant network for weakly supervised semantic segmentation. 2019,. arXiv Google Scholar

[25] Zhang B, Xiao J, Wei Y, et al. Reliability does matter: an end-to-end weakly supervised semantic segmentation approach. 2019,. arXiv Google Scholar

[26] Gao L, Song J, Nie F, et al. Graph-without-cut: an ideal graph learning for image segmentation. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016. Google Scholar

[27] Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. 2013,. arXiv Google Scholar

[28] Selvaraju R R, Cogswell M, Das A, et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 618--626. Google Scholar

[29] Jiang H, Wang J, Yuan Z, et al. Salient object detection: a discriminative regional feature integration approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013. 2083--2090. Google Scholar

[30] Everingham M, Eslami S M A, Van Gool L. The Pascal Visual Object Classes Challenge: A Retrospective. Int J Comput Vis, 2015, 111: 98-136 CrossRef Google Scholar

[31] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: common objects in context. In: Proceedings of European Conference on Computer Vision. Berlin: Springer, 2014. 740--755. Google Scholar

[32] Hariharan B, Arbeláez P, Bourdev L, et al. Semantic contours from inverse detectors. In: Proceedings of 2011 International Conference on Computer Vision, 2011. 991--998. Google Scholar

[33] Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs. 2014,. arXiv Google Scholar

[34] Lin D, Dai J, Jia J, et al. Scribblesup: scribble-supervised convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 3159--3167. Google Scholar

[35] Wang X, You S, Li X, et al. Weakly-supervised semantic segmentation by iteratively mining common object features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1354--1362. Google Scholar

[36] Chaudhry A, Dokania P K, Torr P H. Discovering class-specific pixels for weakly-supervised semantic segmentation. 2017,. arXiv Google Scholar

[37] Hou Q, Jiang P, Wei Y, et al. Self-erasing network for integral object attention. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 549--559. Google Scholar

[38] Shimoda W, Yanai K. Self-supervised difference detection for weakly-supervised semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 5208--5217. Google Scholar

[39] Zhou D, Bousquet O, Lal T N, et al. Learning with local and global consistency. In: Proceedings of Advances in Neural Information Processing Systems, 2004. 321--328. Google Scholar

[40] Paszke A, Gross S, Chintala S, et al. Automatic differentiation in pytorch. In: Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, 2017. Google Scholar

[41] Wu Z, Shen C, van den Hengel A. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition. Pattern Recognition, 2019, 90: 119-133 CrossRef Google Scholar

  • Figure 1

    (Color online) The top row shows the original images and the bottom row shows the CAM results which only localize the most discriminative regions of objects.

  • Figure 2

    (Color online) (a) A simple 3$\times$3 image with two seed pixels. The pixel marked with “F" represents “foreground" seed while the other one represents “background" seed, and the remaining pixels are unknown. (b) The constructed graph in which the cost of each edge is reflected by the edge's thickness. (c) The min-cut algorithm tends to cut the edges with minimum costs. protect łinebreak (d) Segmentation results.

  • Figure 3

    (Color online) The framework of a deep graph cut network for weakly supervised semantic segmentation. Utilizing image-level labels we can train a classification network to localize a part of objects as seed cues. Then the graph cut module takes the seed cues, feature maps extracted from the backbone and segmentation map as input produces the more accurate pixel-level supervision. We train our segmentation network with two loss functions.

  • Figure 4

    (Color online) Qualitative segmentation results on the PASCAL VOC 2012 validation set. Two failure cases are shown in the last row.

  • Table 1  

    Table 1Weights of edges in $E$

    Edge Weight For
    $\{p,q\}$ $C_0$ $W_{pq}~\neq~0$
    $-\log~(H_u^c)$ $p~\in~P,~p~\notin~F~\cup~B~$
    $\{p,S\}$ $C_1$ $p~\in~F$
    0 $p~\in~B$
    $-\log~(1-H_u^c)$ $p~\in~P,~p~\notin~F~\cup~B~$
    $\{p,T\}$ 0 $p~\in~F$
    $C_1$ $p~\in~B$
  • Table 2  

    Table 2Comparison of weakly-supervised semanticsegmentation methods on PASCAL VOC 2012 validation and test image sets

    Method SupervisionTraining set Val Test
    FCN$\dagger$ [2] Pixel-level9k 62.2
    Deeplab$\dagger$ [33] 10k 67.6 70.3
    BoxSup$\dagger$ [17] Box-level10k 62.0 64.6
    ScribbleSup$\dagger$ [34] Scribble-level10k 63.1
    SEC$\dagger$ [4] 10k 50.7 51.1
    AE-PSL$\dagger$ [12] 10k 55.0 55.7
    MCOF [35] 10k 60.3 61.2
    DCSP [36] 10k 60.8 61.9
    SeeNet$\dagger$ [37] 10k 61.1 60.7
    SeeNet [37] 10k 63.1 62.8
    DSRG [5] 10k 61.4 63.2
    AffinityNet$\dagger$ [13] Image-level10k 58.4 60.5
    AffinityNet* [13] 10k 61.7 63.7
    CIAN [14] 10k 64.1 64.7
    SSENet* [24] 10k 63.3 64.9
    FickleNet [11] 10k 64.9 65.3
    SSDD* [38] 10k 64.9 65.5
    RRM [25] 10k 66.3 66.5
    DGCN* 10k 64.064.6
  • Table 3  

    Table 3Comparison of mIoU of our approach with different settingson PASCAL VOC 2012 validation set

    Method mIoU bkg airplane bike bird boat bottle bus car cat chair
    DSRG 59.3 87.0 65.3 32.5 71.1 38.2 66.9 78.1 68.4 80.4 27.6
    + Graph Cut 60.8 87.6 72.0 34.5 71.5 39.1 67.3 80.3 70.5 80.7 24.7
    + ResNet-38 62.7 88.0 76.4 35.2 76.9 44.7 72.8 82.5 74.6 79.0 25.4
    + Retrain 64.0 88.7 77.8 35.9 78.5 45.8 73.7 82.1 76.1 79.6 26.6
    Method cow table dog horse mbike person plant sheep sofa train tv
    DSRG 63.1 19.1 76.4 67.9 68.7 70.8 42.0 76.2 33.8 52.1 59.1
    + Graph Cut 63.8 29.9 73.1 69.5 69.9 71.8 43.1 79.2 33.9 55.7 59.0
    + ResNet-38 71.6 29.2 74.4 71.4 71.9 70.8 49.5 76.5 34.0 51.5 60.1
    + Retrain 73.2 30.5 75.6 72.9 71.2 73.1 50.8 79.7 35.3 55.2 61.3
  • Table 4  

    Table 4Performance with different values of $K$ on PASCAL VOC 2012validation set with $\theta=0.7$

    $K=6$ $K=8$–12 $K=14$
    Our method 62.5 62.7 62.4
  • Table 5  

    Table 5Performance with different values of $\theta$ on PASCAL VOC 2012validation set with $K=9$

    $\theta=0.0$ $\theta=0.3$ $\theta=0.5$ $\theta=0.7$ $\theta=0.9$
    Our method 62.4 62.3 62.6 62.7 62.0
  • Table 6  

    Table 6Comparisons of per-class IoU on COCO validation set

    Class DSRG Ours Class DSRG Ours Class DSRG Ours
    background 78.3 81.1 handbag 4.1 3.6 pizza 16.8 16.7
    person 58.7 58.3 tie 7.1 3.5 donut 23.5 22.8
    bicycle 28.2 32.6 suitcase 27.0 27.0 cake 9.0 14.6
    car 30.0 30.6 frisbee 21.1 17.8 chair 16.8 15.7
    motorcycle 44.2 47.5 skis 8.4 10.4 couch 18.3 19.3
    airplane 41.0 45.3snowboard 10.4 13.6 potted plant 19.6 12.6
    bus 47.0 49.3 sports ball 18.3 19.0 bed 31.7 32.3
    train 46.2 47.9 kite 21.3 22.1 dining table 20.0 10.6
    truck 30.6 31.6 baseball bat 4.6 7.6 toilet 49.1 47.4
    boat 21.8 27.2 baseball glove 5.1 8.5 tv 17.8 20.3
    traffic light 23.6 21.8 skateboard 15.0 16.4 laptop 33.6 29.5
    fire hydrant 40.5 43.8 surfboard 23.2 25.1 mouse 18.7 21.5
    stop sign 59.2 58.5 tennis racket 23.4 34.0 remote 26.6 21.2
    parking meter 29.8 33.9 bottle 12.6 20.4 keyboard 28.4 37.5
    bench 20.2 25.7 wine glass 21.7 23.3 cell phone 32.8 26.2
    bird 30.0 33.3 cup 16.7 21.3 microwave 19.1 20.7
    cat 58.0 55.4 fork 10.7 9.6 oven 22.2 25.9
    dog 43.0 39.8 knife 3.9 4.3 toaster 0.0 0.0
    horse 39.9 38.2 spoon 2.3 3.3 sink 28.5 28.4
    sheep 40.4 40.7 bowl 23.7 19.2 refrigerator 21.2 30.2
    cow 33.7 37.7 banana 37.8 31.6 book 12.1 10.1
    elephant 62.0 62.9 apple 19.9 24.0 clock 33.4 40.5
    bear 50.1 48.1 sandwich 7.1 6.5 vase 22.6 20.1
    zebra 68.4 75.0 orange 18.0 15.8 scissors 13.3 15.5
    giraffe 63.7 66.1 broccoli 24.8 29.5 teddy bear 37.2 37.1
    backpack 11.6 9.1 carrot 11.8 3.1 hair drier 0.0 0.0
    umbrella 33.2 37.5 hot dog 12.2 11.2 toothbrush 1.5 3.6