logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 2 : 120104(2021) https://doi.org/10.1007/s11432-020-3156-7

Task-wise attention guided part complementary learning for few-shot image classification

More info
  • ReceivedNov 8, 2020
  • AcceptedDec 24, 2020
  • PublishedJan 20, 2021

Abstract


Acknowledgment

This work was supported by Science, Technology and Innovation Commission of Shenzhen Municipality (Grant No. JCYJ20180306171131643) and National Natural Science Foundation of China (Grant No. 61772425).


References

[1] Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems, 2015. 91--99. Google Scholar

[2] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017. 2117-2125. Google Scholar

[3] Redmon J, Divvala S, Girshick R, et al. You only look once: unified, real-time object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 779--788. Google Scholar

[4] Cheng G, Zhou P C, Han J W. RIFD-CNN: rotation-invariant and fisher discriminative convolutional neural networks for object detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2884--2893. Google Scholar

[5] Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision, 2016. 21--37. Google Scholar

[6] Cheng G, Han J, Zhou P. Learning Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection. IEEE Trans Image Process, 2019, 28: 265-278 CrossRef PubMed ADS Google Scholar

[7] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770--778. Google Scholar

[8] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014,. arXiv Google Scholar

[9] Cheng G, Yang C, Yao X. When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs. IEEE Trans Geosci Remote Sens, 2018, 56: 2811-2821 CrossRef ADS Google Scholar

[10] Cheng G, Gao D C, Liu Y, et al. Multi-scale and discriminative part detectors based features for multi-label image classification. In: Proceedings of International Joint Conference on Artificial Intelligence, 2018. 649--655. Google Scholar

[11] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015. 3431--3440. Google Scholar

[12] Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, 2015. 1520--1528. Google Scholar

[13] Wang N, Ma S, Li J. Multistage attention network for image inpainting. Pattern Recognition, 2020, 106: 107448 CrossRef Google Scholar

[14] Song L, Wang C, Zhang L. Unsupervised domain adaptive re-identification: Theory and practice. Pattern Recognition, 2020, 102: 107173 CrossRef Google Scholar

[15] Wei X S, Wang P, Liu L. Piecewise Classifier Mappings: Learning Fine-Grained Learners for Novel Categories With Few Examples. IEEE Trans Image Process, 2019, 28: 6116-6125 CrossRef PubMed ADS arXiv Google Scholar

[16] Ji Z, Chai X, Yu Y. Improved prototypical networks for few-Shot learning. Pattern Recognition Lett, 2020, 140: 81-87 CrossRef Google Scholar

[17] Ji Z, Sun Y, Yu Y. Attribute-Guided Network for Cross-Modal Zero-Shot Hashing.. IEEE Trans Neural Netw Learning Syst, 2020, 31: 321-330 CrossRef PubMed Google Scholar

[18] Wang Y Q, Yao Q M, Kwok J T, et al. Generalizing from a few examples: a survey on few-shot learning. 2019,. arXiv Google Scholar

[19] Ji Z, Yan J T, Wang Q, et al. Triple discriminator generative adversarial network for zero-shot image classification. Sci China Inform Sci, 2020. doi: 10.1007/s11432-020-3032-8. Google Scholar

[20] Vilalta R, Drissi Y. Artificial Intelligence Rev, 2002, 18: 77-95 CrossRef Google Scholar

[21] Bertinetto L, Henriques J F, Torr P H, et al. Meta-learning with differentiable closed-form solvers. In: Proceedings of International Conference on Learning Representations, 2019. 1--15. Google Scholar

[22] Snell J, Swersky K, Zemel R. Prototypical networks for few-shot learning. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 4077--4087. Google Scholar

[23] Vinyals O, Blundell C, Lillicrap T, et al. Matching networks for one shot learning. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 3630--3638. Google Scholar

[24] Andrychowicz M, Denil M, Gomez S, et al. Learning to learn by gradient descent by gradient descent. In: Proceedings of Advances in Neural Information Processing Systems, 2016. 3981--3989. Google Scholar

[25] Ravi S, Larochelle H. Optimization as a model for few-shot learning. In: Proceedings of International Conference on Learning Representations, 2017. 1--11. Google Scholar

[26] Santoro A, Bartunov S, Botvinick M, et al. Meta-learning with memory-augmented neural networks. In: Proceedings of the 33rd International Conference on Machine Learning, 2016. 1842--1850. Google Scholar

[27] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning, 2017. 1126--1135. Google Scholar

[28] Li Z G, Zhou F W, Chen F, et al. Meta-SGD: learning to learn quickly for few-shot learning. 2017,. arXiv Google Scholar

[29] Jamal M, Qi G J. Task agnostic meta-learning for few-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 11719--11727. Google Scholar

[30] Zhou F W, Wu B , Li Z G. Deep meta-learning: learning to learn in the concept space. 2018,. arXiv Google Scholar

[31] Sun Q R, Liu Y Y, Chua T, et al. Meta-transfer learning for few-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 403--412. Google Scholar

[32] Lee K, Maji S, Ravichandran A, et al. Meta-learning with differentiable convex optimization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 10657--10665. Google Scholar

[33] Lifchitz Y, Avrithis Y, Picard S, et al. Dense classification and implanting for few-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 9258--9267. Google Scholar

[34] Munkhdalai T, Yu H. Meta networks. In: Proceedings of the 34th International Conference on Machine Learning, 2017. 2554--2563. Google Scholar

[35] Sung F, Yang Y X, Zhang L, et al. Learning to compare: relation network for few-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1199--1208. Google Scholar

[36] Wang P, Liu L Q, Shen C H, et al. Multi-attention network for one shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2721--2729. Google Scholar

[37] Li W, Xu J, Huo J. Distribution Consistency Based Covariance Metric Networks for Few-Shot Learning. AAAI, 2019, 33: 8642-8649 CrossRef Google Scholar

[38] Li W B, Wang L, Xu J L, et al. Revisiting local descriptor based image-to-class measure for few-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 7260--7268. Google Scholar

[39] Li H Y, Eigen D, Dodge S, et al. Finding task-relevant features for few-shot learning by category traversal. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 1--10. Google Scholar

[40] Zhang H G, Zhang J, Koniusz P. Few-shot learning via saliency-guided hallucination of samples. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 2770--2779. Google Scholar

[41] Alfassy A, Karlinsky L, Aides A, et al. LaSO: Label-set operations networks for multi-label few-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 6548--6557. Google Scholar

[42] Chen Z T, Fu Y W, Wang Y X, et al. Image deformation meta-networks for one-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 8680--8689. Google Scholar

[43] Chu W H, Li Y J, Chang J C, et al. Spot and learn: a maximum-entropy patch sampler for few-shot image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 6251--6260. Google Scholar

[44] Bearman A, Russakovsky O, Ferrari V, Li F. What's the point: semantic segmentation with point supervision. In: Proceedings of the 14th European Conference on Computer Vision, 2016. 549--565. Google Scholar

[45] Wah C, Branson S, Welinder P, et al. The caltech-ucsd birds-200-2011 dataset. 2011. https://authors.library.caltech.edu/27452/. Google Scholar

[46] Hilliard N, Phillips L, Howland S, et al. Few-shot learning with metric-agnostic conditional embeddings. 2018,. arXiv Google Scholar

[47] Kim J, Kim T, Kim S, Yoo C D. Edge-labeling graph neural network for few-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 11--20. Google Scholar

[48] Chen W Y, Liu Y C, Kira Z, et al. A closer look at few-shot classification. 2019,. arXiv Google Scholar

[49] Zhang C, Cai Y J, Lin G S, et al. DeepEMD: few-shot image classification with differentiable earth mover's distance and structured classifiers. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020. 12203--12213. Google Scholar

[50] Ye H J, Hu H X, Zhan D C, et al. Learning embedding adaptation for few-shot learning. 2018,. arXiv Google Scholar

[51] Yang L, Li L L, Zhang Z L, et al. DPGN: distribution propagation graph network for few-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020. 13390--13399. Google Scholar

[52] Schwartz E, Karlinsky L, Feris R, et al. Baby steps towards few-shot learning with multiple semantics. 2019,. arXiv Google Scholar

[53] Zhang X L, Wei Y C, Feng J S, et al. Adversarial complementary learning for weakly supervised object localization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1325--1334. Google Scholar

  • Figure 1

    (Color online) Illustration of the classification network utilized in our method during meta-training phase. The network framework includes backbone network, novel layer, global average pooling layer, and a classifier. “ConvBlock3" represents the first three blocks of VGG Net. The number at the top (e.g., “28") denotes the spatial size of feature map, and that at the bottom (e.g., “512") denotes the number of channels.

  • Figure 2

    (Color online) Illustration of the proposed TPNet for 5-way 1-shot task in training phase. As shown, the TPNet framework consists of a task-wise attention module for increasing the sensitivity of network to discriminative information and a part complementary learning module for learning complementary descriptions. $\tau$ indicates the “erasing" operation of PCL module according to a pre-defined threshold.

  • Figure 3

    (Color online) Illustration of the proposed TPNet for 5-way 1-shot task in testing phase. As shown, part complementary learning module is composed of two branches for learning discriminative and complementary features, respectively. The max fusion module can obtain multiple representative features by integrating the information of two branches. $\tau$ indicates the “erasing" operation of PCL module according to a pre-defined threshold.

  • Figure 4

    (Color online) The influence of parameters $\lambda~$ and $\tau~$ on CUB dataset. (a) corresponds to the 5-way 1-shot setting, while (b) corresponds to the 5-way 5-shot setting. It is obvious that when $\lambda~=~0.1$ and $\tau~=~0.5$, the proposed TPNet can achieve the best performance in both 5-way 1-shot and 5-way 5-shot settings.

  • Figure 5

    (Color online) The influence of parameters $\lambda~$ and $\tau~$ on miniImageNet dataset. (a) corresponds to the 5-way 1-shot setting, while (b) corresponds to the 5-way 5-shot setting. It is obvious that when $\lambda~=~0.5$ and $\tau~=~0.4$, the proposed TPNet can achieve the best performance in both 5-way 1-shot and 5-way 5-shot settings.

  • Figure 6

    (Color online) Visualization of the proposed method. Red regions represent the positive region in favor of classification result, while blue regions are negative regions that reduce the recognition confidence. From top to bottom: input images; branch A; branch B; the feature fusion for branches A and B, without TWA module; the feature fusion for branches A and B, with TWA module. In the 5th row, comprehensive feature representations are acquired through the information fusion of two branches and the TWA module introduced. (Best viewed in color.)

  • Table 1  

    Table 1Few-shot classification accuracy on miniImagNet dataset with 95% confidence intervals$^{\rm~a)}$

    Model 1-shot accuracy (%) 5-shot accuracy (%)
    Meta net [34] 49.21 $\pm$ 0.96
    Matching net [23] 46.6 60.0
    Prototypical net [22] 49.42 $\pm$ 0.78 68.20 $\pm$ 0.66
    Relation net [35] 50.44 $\pm$ 0.82 65.32 $\pm$ 0.70
    DN4 net [38] 51.24 $\pm$ 0.74 71.02 $\pm$ 0.64
    EGNN+transduction [47] 76.37
    MAML [27] 48.70 $\pm$ 1.84 63.11 $\pm$ 0.92
    MTL [31] 61.2 $\pm$ 1.8 75.5 $\pm$ 0.8
    LR-D2 [21] 51.9 $\pm$ 0.2 68.7 $\pm$ 0.2
    Spot and learn [43] 51.03 $\pm$ 0.78 67.96 $\pm$ 0.71
    Saliency hallucination [40] 57.45 $\pm$ 0.88 72.01 $\pm$ 0.67
    TPNet 59.31 $\pm$ 0.99 79.21 $\pm$ 0.64

    a) Both 5-way 1-shot and 5-way 5-shot experimental settings are taken into consideration. The best results are presented in boldface. `–' indicates not reported.

  • Table 2  

    Table 2Few-shot classification accuracy on CUB dataset with 95% confidence intervals$^{\rm~a)}$

    Model 1-shot accuracy (%) 5-shot accuracy (%)
    Matching net* [23] 61.16 $\pm$ 0.89 72.86 $\pm$ 0.70
    Prototypical net* [22] 51.31 $\pm$ 0.91 70.77 $\pm$ 0.69
    Relation net* [35] 62.45 $\pm$ 0.98 76.11 $\pm$ 0.69
    DN4-DA net [38] 53.15 $\pm$ 0.84 81.90 $\pm$ 0.60
    MAML* [27] 55.92 $\pm$ 0.95 72.09 $\pm$ 0.76
    Baseline+ [48] 60.53 $\pm$ 0.83 79.34 $\pm$ 0.61
    DeepEMD [49] 75.65 $\pm$ 0.83 88.69 $\pm$ 0.50
    FEAT [50] 68.87 $\pm$ 0.22 82.90 $\pm$ 0.15
    DPGN [51] 75.71 $\pm$ 0.47 91.48 $\pm$ 0.33
    MACO [46] 60.76 74.96
    Multiple-semantics [52] 76.1 82.9
    TPNet 77.30 $\pm$ 0.86 94.20 $\pm$ 0.34

    a) Both 5-way 1-shot and 5-way 5-shot experimental settings are taken into consideration. The best results are presented in boldface. `*' indicates the results reported by [48].

  • Table 3  

    Table 3Comparison of the proposed TPNet model under various configurations on miniImageNet with 95% confidence intervals$^{\rm~a)}$

    Model PCL EB TWA 1-shot accuracy (%) 5-shot accuracy (%)
    0 ding56 ding56 ding56 56.75 $\pm$ 0.89 77.22 $\pm$ 0.66
    1 ding52 ding56 ding56 56.87 $\pm$ 0.92 78.25 $\pm$ 0.64
    2 ding52 ding52 ding56 56.92 $\pm$ 0.90 78.62 $\pm$ 0.65
    3 ding52 ding56 ding52 59.31 $\pm$ 0.99 79.21 $\pm$ 0.64
    4 ding52 ding52 ding52 58.59 $\pm$ 0.91 78.86 $\pm$ 0.64

    a) Both 5-way 1-shot and 5-way 5-shot experimental settings are taken into consideration, and the best results are presented in boldface. “Model 0" is the baseline model. We can find that the model achieves optimal performance under the 3rd configuration.

  • Table 4  

    Table 4Comparison of the proposed TPNet model under various configurations on CUB with 95% confidence intervals$^{\rm~a)}$

    Model PCL EB TWA 1-shot accuracy (%) 5-shot accuracy (%)
    0 ding56 ding56 ding56 74.81 $\pm$ 0.88 92.61 $\pm$ 0.35
    1 ding52 ding56 ding56 75.61 $\pm$ 0.90 93.60 $\pm$ 0.36
    2 ding52 ding52 ding56 75.69 $\pm$ 0.90 93.55 $\pm$ 0.36
    3 ding52 ding56 ding52 77.30 $\pm$ 0.86 94.20 $\pm$ 0.34
    4 ding52 ding52 ding52 76.40 $\pm$ 0.86 93.85 $\pm$ 0.38

    a) Both 5-way 1-shot and 5-way 5-shot experimental settings are taken into consideration, and the best results are presented in boldface. “Model 0" is the baseline model. We can find that the model achieves optimal performance under the 3rd configuration.