This work was supported by National Natural Science Foundation of China (Grant Nos. 61876212, 61733007) and Zhejiang Lab (Grant No. 2019NB0AB02).
[1] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2012. 1097--1105. Google Scholar
[2] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3431--3440. Google Scholar
[3] Huang Z L, Wang X G, Wei Y C, et al. CCNet:criss-cross attention for semantic segmentation. 2020. arXiv:1811.11721. Google Scholar
[4] Kolesnikov A, Lampert C H. Seed, expand and constrain: three principles for weakly-supervised image segmentation. In: Proceedings of European Conference on Computer Vision. Berlin: Springer, 2016. 695--711. Google Scholar
[5] Huang Z, Wang X, Wang J, et al. Weakly-supervised semantic segmentation network with deep seeded region growing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 7014--7023. Google Scholar
[6] Zhou Z H. A brief introduction to weakly supervised learning. Natl Sci Rev, 2018, 5: 44-53 CrossRef Google Scholar
[7] Wei Y, Xiao H, Shi H, et al. Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 7268--7277. Google Scholar
[8] Tang P, Wang X G, Bai S, et al. PCL: proposal cluster learning forweakly supervised object detection. IEEE Trans Pattern Anal Mach Intell, 2020, 42: 176--191. Google Scholar
[9] Wang X G, Deng X B, Fu Q, et al. A weaklysupervisedframework for COVID-19 classiffication and lesion localization from chest CT. IEEE Trans Medical Imaging, 2020,39: 2615--2625. Google Scholar
[10] Zhou B, Khosla A, Lapedriza A, et al. Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2921--2929. Google Scholar
[11] Lee J, Kim E, Lee S, et al. Ficklenet: weakly and semi-supervised semantic image segmentation using stochastic inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 5267--5276. Google Scholar
[12] Wei Y, Feng J, Liang X, et al. Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1568--1576. Google Scholar
[13] Ahn J, Kwak S. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 4981--4990. Google Scholar
[14] Fan J S, Zhang Z X, Tan T N. Cian: cross-image affinity net for weakly supervised semantic segmentation. 2018,. arXiv Google Scholar
[15] Boykov Y Y, Jolly M P. Interactive graph cuts for optimal boundary & region segmentation of objects in ND images. In: Proceedings of the 8th IEEE International Conference on Computer Vision, 2001. 105--112. Google Scholar
[16] Boykov Y, Kolmogorov V. An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. IEEE Trans Pattern Anal Machine Intell, 2004, 26: 1124-1137 CrossRef Google Scholar
[17] Dai J, He K, Sun J. Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 1635--1643. Google Scholar
[18] Tang M, Perazzi F, Djelouah A, et al. On regularized losses for weakly-supervised CNN segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018. 507--522. Google Scholar
[19] Papandreou G, Chen L C, Murphy K, et al. Weakly-and semi-supervised learning of a dcnn for semantic image segmentation. 2015,. arXiv Google Scholar
[20] Tang M, Djelouah A, Perazzi F, et al. Normalized cut loss for weakly-supervised cnn segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1818--1827. Google Scholar
[21] Bearman A, Russakovsky O, Ferrari V, et al. What's the point: semantic segmentation with point supervision. In: Proceedings of European Conference on Computer Vision. Berlin: Springer, 2016. 549--565. Google Scholar
[22] Lee J, Kim E, Lee S, et al. Frame-to-frame aggregation of active regions in web videos for weakly supervised semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 6808--6818. Google Scholar
[23] Yang J, Sun X, Lai Y K, et al. Recognition from web data: a progressive filtering approach. IEEETrans Image Process, 2018, 27: 5303--5315. Google Scholar
[24] Wang Y, Zhang J, Kan M, et al. Self-supervised scale equivariant network for weakly supervised semantic segmentation. 2019,. arXiv Google Scholar
[25] Zhang B, Xiao J, Wei Y, et al. Reliability does matter: an end-to-end weakly supervised semantic segmentation approach. 2019,. arXiv Google Scholar
[26] Gao L, Song J, Nie F, et al. Graph-without-cut: an ideal graph learning for image segmentation. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, 2016. Google Scholar
[27] Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks: visualising image classification models and saliency maps. 2013,. arXiv Google Scholar
[28] Selvaraju R R, Cogswell M, Das A, et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, 2017. 618--626. Google Scholar
[29] Jiang H, Wang J, Yuan Z, et al. Salient object detection: a discriminative regional feature integration approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013. 2083--2090. Google Scholar
[30] Everingham M, Eslami S M A, Van Gool L. The Pascal Visual Object Classes Challenge: A Retrospective. Int J Comput Vis, 2015, 111: 98-136 CrossRef Google Scholar
[31] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: common objects in context. In: Proceedings of European Conference on Computer Vision. Berlin: Springer, 2014. 740--755. Google Scholar
[32] Hariharan B, Arbeláez P, Bourdev L, et al. Semantic contours from inverse detectors. In: Proceedings of 2011 International Conference on Computer Vision, 2011. 991--998. Google Scholar
[33] Chen L C, Papandreou G, Kokkinos I, et al. Semantic image segmentation with deep convolutional nets and fully connected crfs. 2014,. arXiv Google Scholar
[34] Lin D, Dai J, Jia J, et al. Scribblesup: scribble-supervised convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 3159--3167. Google Scholar
[35] Wang X, You S, Li X, et al. Weakly-supervised semantic segmentation by iteratively mining common object features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1354--1362. Google Scholar
[36] Chaudhry A, Dokania P K, Torr P H. Discovering class-specific pixels for weakly-supervised semantic segmentation. 2017,. arXiv Google Scholar
[37] Hou Q, Jiang P, Wei Y, et al. Self-erasing network for integral object attention. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 549--559. Google Scholar
[38] Shimoda W, Yanai K. Self-supervised difference detection for weakly-supervised semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, 2019. 5208--5217. Google Scholar
[39] Zhou D, Bousquet O, Lal T N, et al. Learning with local and global consistency. In: Proceedings of Advances in Neural Information Processing Systems, 2004. 321--328. Google Scholar
[40] Paszke A, Gross S, Chintala S, et al. Automatic differentiation in pytorch. In: Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, 2017. Google Scholar
[41] Wu Z, Shen C, van den Hengel A. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition. Pattern Recognition, 2019, 90: 119-133 CrossRef Google Scholar
Figure 1
(Color online) The top row shows the original images and the bottom row shows the CAM results which only localize the most discriminative regions of objects.
Figure 2
(Color online) (a) A simple 3$\times$3 image with two seed pixels. The pixel marked with “F" represents “foreground" seed while the other one represents “background" seed, and the remaining pixels are unknown. (b) The constructed graph in which the cost of each edge is reflected by the edge's thickness. (c) The min-cut algorithm tends to cut the edges with minimum costs. protect łinebreak (d) Segmentation results.
Figure 3
(Color online) The framework of a deep graph cut network for weakly supervised semantic segmentation. Utilizing image-level labels we can train a classification network to localize a part of objects as seed cues. Then the graph cut module takes the seed cues, feature maps extracted from the backbone and segmentation map as input produces the more accurate pixel-level supervision. We train our segmentation network with two loss functions.
Figure 4
(Color online) Qualitative segmentation results on the PASCAL VOC 2012 validation set. Two failure cases are shown in the last row.
Edge | Weight | For |
$\{p,q\}$ | $C_0$ | $W_{pq}~\neq~0$ |
$-\log~(H_u^c)$ | $p~\in~P,~p~\notin~F~\cup~B~$ | |
$\{p,S\}$ | $C_1$ | $p~\in~F$ |
0 | $p~\in~B$ | |
$-\log~(1-H_u^c)$ | $p~\in~P,~p~\notin~F~\cup~B~$ | |
$\{p,T\}$ | 0 | $p~\in~F$ |
$C_1$ | $p~\in~B$ |
Method | Supervision | Training set | Val | Test |
FCN$\dagger$ | Pixel-level | 9k | – | 62.2 |
Deeplab$\dagger$ | 10k | 67.6 | 70.3 | |
BoxSup$\dagger$ | Box-level | 10k | 62.0 | 64.6 |
ScribbleSup$\dagger$ | Scribble-level | 10k | 63.1 | – |
SEC$\dagger$ | 10k | 50.7 | 51.1 | |
AE-PSL$\dagger$ | 10k | 55.0 | 55.7 | |
MCOF | 10k | 60.3 | 61.2 | |
DCSP | 10k | 60.8 | 61.9 | |
SeeNet$\dagger$ | 10k | 61.1 | 60.7 | |
SeeNet | 10k | 63.1 | 62.8 | |
DSRG | 10k | 61.4 | 63.2 | |
AffinityNet$\dagger$ | Image-level | 10k | 58.4 | 60.5 |
AffinityNet* | 10k | 61.7 | 63.7 | |
CIAN | 10k | 64.1 | 64.7 | |
SSENet* | 10k | 63.3 | 64.9 | |
FickleNet | 10k | 64.9 | 65.3 | |
SSDD* | 10k | 64.9 | 65.5 | |
RRM | 10k | |||
DGCN* | 10k | 64.0 | 64.6 |
Method | mIoU | bkg | airplane | bike | bird | boat | bottle | bus | car | cat | chair |
DSRG | 59.3 | 87.0 | 65.3 | 32.5 | 71.1 | 38.2 | 66.9 | 78.1 | 68.4 | 80.4 | |
+ Graph Cut | 60.8 | 87.6 | 72.0 | 34.5 | 71.5 | 39.1 | 67.3 | 80.3 | 70.5 | 24.7 | |
+ ResNet-38 | 62.7 | 88.0 | 76.4 | 35.2 | 76.9 | 44.7 | 72.8 | 74.6 | 79.0 | 25.4 | |
+ Retrain | 82.1 | 79.6 | 26.6 | ||||||||
Method | cow | table | dog | horse | mbike | person | plant | sheep | sofa | train | tv |
DSRG | 63.1 | 19.1 | 67.9 | 68.7 | 70.8 | 42.0 | 76.2 | 33.8 | 52.1 | 59.1 | |
+ Graph Cut | 63.8 | 29.9 | 73.1 | 69.5 | 69.9 | 71.8 | 43.1 | 79.2 | 33.9 | 59.0 | |
+ ResNet-38 | 71.6 | 29.2 | 74.4 | 71.4 | 70.8 | 49.5 | 76.5 | 34.0 | 51.5 | 60.1 | |
+ Retrain | 75.6 | 71.2 | 55.2 |
$K=6$ | $K=8$–12 | $K=14$ | |
Our method | 62.5 | 62.4 |
$\theta=0.0$ | $\theta=0.3$ | $\theta=0.5$ | $\theta=0.7$ | $\theta=0.9$ | |
Our method | 62.4 | 62.3 | 62.6 | 62.0 |
Class | DSRG | Ours | Class | DSRG | Ours | Class | DSRG | Ours |
background | 78.3 | handbag | 3.6 | pizza | 16.7 | |||
person | 58.3 | tie | 3.5 | donut | 22.8 | |||
bicycle | 28.2 | suitcase | 27.0 | 27.0 | cake | 9.0 | ||
car | 30.0 | frisbee | 17.8 | chair | 15.7 | |||
motorcycle | 44.2 | skis | 8.4 | couch | 18.3 | |||
airplane | 41.0 | snowboard | 10.4 | potted plant | 12.6 | |||
bus | 47.0 | sports ball | 18.3 | bed | 31.7 | |||
train | 46.2 | kite | 21.3 | dining table | 10.6 | |||
truck | 30.6 | baseball bat | 4.6 | toilet | 47.4 | |||
boat | 21.8 | baseball glove | 5.1 | tv | 17.8 | |||
traffic light | 21.8 | skateboard | 15.0 | laptop | 29.5 | |||
fire hydrant | 40.5 | surfboard | 23.2 | mouse | 18.7 | |||
stop sign | 58.5 | tennis racket | 23.4 | remote | 21.2 | |||
parking meter | 29.8 | bottle | 12.6 | keyboard | 28.4 | |||
bench | 20.2 | wine glass | 21.7 | cell phone | 26.2 | |||
bird | 30.0 | cup | 16.7 | microwave | 19.1 | |||
cat | 55.4 | fork | 9.6 | oven | 22.2 | |||
dog | 39.8 | knife | 3.9 | toaster | 0.0 | 0.0 | ||
horse | 38.2 | spoon | 2.3 | sink | 28.4 | |||
sheep | 40.4 | bowl | 19.2 | refrigerator | 21.2 | |||
cow | 33.7 | banana | 31.6 | book | 10.1 | |||
elephant | 62.0 | apple | 19.9 | clock | 33.4 | |||
bear | 48.1 | sandwich | 6.5 | vase | 20.1 | |||
zebra | 68.4 | orange | 15.8 | scissors | 13.3 | |||
giraffe | 63.7 | broccoli | 24.8 | teddy bear | 37.1 | |||
backpack | 9.1 | carrot | 3.1 | hair drier | 0.0 | 0.0 | ||
umbrella | 33.2 | hot dog | 11.2 | toothbrush | 1.5 |