logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 2 : 120101(2021) https://doi.org/10.1007/s11432-020-3032-8

Triple discriminator generative adversarial network for zero-shot image classification

More info
  • ReceivedMar 6, 2020
  • AcceptedJul 1, 2020
  • PublishedJan 20, 2021

Abstract


Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant Nos. 61771329, 61632018).


References

[1] Fu Y, Xiang T, Jiang Y G. Recent Advances in Zero-Shot Recognition: Toward Data-Efficient Understanding of Visual Content. IEEE Signal Process Mag, 2018, 35: 112-125 CrossRef ADS Google Scholar

[2] Guo G, Wang H, Yan Y. Large margin deep embedding for aesthetic image classification. Sci China Inf Sci, 2020, 63: 119101 CrossRef Google Scholar

[3] Zhu X X, Anguelov D, Ramanan D. Capturing long-tail distributions of object subcategories. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2014. 915--922. Google Scholar

[4] Guo Y C, Ding G G, Han J G, et al. Synthesizing samples for zero-shot learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017. 1774--1780. Google Scholar

[5] Ji Z, Sun Y, Yu Y. Attribute-Guided Network for Cross-Modal Zero-Shot Hashing. IEEE Trans Neural Netw Learning Syst, 2020, 31: 321-330 CrossRef Google Scholar

[6] Long Y, Liu L, Shao L, et al. From zero-shot learning to conventional supervised classification: Unseen visual data synthesis. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1627--1636. Google Scholar

[7] Yu Y L, Ji Z, Fu Y W, et al. Stacked semantics-guided attention model for fine-grained zero-shot learning. In: Proceedings of Advances in Neural Information Processing Systems, 2018. 5995--6004. Google Scholar

[8] Akata Z, Perronnin F, Harchaoui Z, et al. Label embedding for attribute-based classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013. 819--826. Google Scholar

[9] Akata Z, Perronnin F, Harchaoui Z. Label-Embedding for Image Classification. IEEE Trans Pattern Anal Mach Intell, 2016, 38: 1425-1438 CrossRef Google Scholar

[10] Changpinyo S, Chao W L, Gong B Q, et al. Synthesized classifiers for zero-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 5327--5336. Google Scholar

[11] Wang X Y, Ji Q. A unified probabilistic approach modeling relationships between attributes and objects. In: Proceedings of IEEE International Conference on Computer Vision, 2013. 2120--2127. Google Scholar

[12] Nguyen T D, Le T, Vu H, et al. Dual discriminator generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 2670--2680. Google Scholar

[13] Wah C, Branson S, Welinder P, et al. The caltech-ucsd birds-200-2011 dataset. 2011. Google Scholar

[14] van Horn G, Branson S, Farrell R, et al. Building a bird recognition app and large scale dataset with citizen scientists: the fine print in fine-grained dataset collection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015. 595--604. Google Scholar

[15] Elhoseiny M, Saleh B, Elgammal A. Write a classifier: zero-shot learning using purely textual descriptions. In: Proceedings of IEEE International Conference on Computer Vision, 2013. 2584--2591. Google Scholar

[16] Lei Ba J, Swersky K, Fidler S. Predicting deep zero-shot convolutional neural networks using textual descriptions. In Proceedings of IEEE International Conference on Computer Vision, 2015. 4247--4255. Google Scholar

[17] Reed S, Akata Z, Lee H, et al. Learning deep representations of fine-grained visual descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 49--58. Google Scholar

[18] Qiao R Z, Liu L Q, Shen C H, et al. Less is more: zero-shot learning from online textual documents with noise suppression. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2249--2257. Google Scholar

[19] Elhoseiny M, Zhu Y Z, Zhang H, et al. Link the head to the “beak”: zero shot learning from noisy text description at part precision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. 6288--6297. Google Scholar

[20] Zhu J Y, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 2223--2232. Google Scholar

[21] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in neural information processing systems, 2014. 2672--2680. Google Scholar

[22] Martin A, Bottou L. Towards principled methods for training generative adversarial networks. 2017,. arXiv Google Scholar

[23] Gulrajani I, Ahmed F, Arjovsky M, et al. Improved training of wasserstein gans. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 5767--5777. Google Scholar

[24] Arjovsky M, Chintala S, Bottou L. Wasserstein generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning, 2017. 214--223. Google Scholar

[25] Xian Y Q, Lorenz T, Schiele B, et al. Feature generating networks for zero-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 5542--5551. Google Scholar

[26] Schonfeld E, Ebrahimi S, Sinha S, et al. Generalized zero-and few-shot learning via aligned variational autoencoders. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. 8247--8255. Google Scholar

[27] Bucher M, Herbin S, Jurie F. Generating visual representations for zero-shot classification. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 2666--2673. Google Scholar

[28] Zhu Y Z, Elhoseiny M, Liu B C, et al. A generative adversarial approach for zero-shot learning from noisy texts. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 1004--1013. Google Scholar

[29] Li Y J, Swersky K, Zemel R. Generative moment matching networks. In: Proceedings of International Conference on Machine Learning, 2015. 1718--1727. Google Scholar

[30] Zhang H, Xu T, Elhoseiny M, et al. SPDA-CNN: unifying semantic part detection and abstraction for fine-grained recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 1143--1152. Google Scholar

[31] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Inf Processing Manage, 1988, 24: 513-523 CrossRef Google Scholar

[32] Akata Z, Malinowski M, Fritz M, et al. Multi-cue zero-shot learning with strong supervision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 59--68. Google Scholar

[33] Romera-Paredes B, Torr P. An embarrassingly simple approach to zero-shot learning. In: Proceedings of International Conference on Machine Learning, 2015. 2152--2161. Google Scholar

[34] Akata Z, Reed S, Walter D, et al. Evaluation of output embeddings for fine-grained image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015. 2927--2936. Google Scholar

[35] Chao W L, Changpinyo S, Gong B, et al. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In: Proceedings of European Conference on Computer Vision, 2016. 52--68. Google Scholar

[36] Ji Z, Xiong K, Pang Y. Video Summarization With Attention-Based Encoder-Decoder Networks. IEEE Trans Circuits Syst Video Technol, 2020, 30: 1709-1717 CrossRef Google Scholar

[37] Wang Z, Liu X, Lin J. Multi-attention based cross-domain beauty product image retrieval. Sci China Inf Sci, 2020, 63: 120112 CrossRef Google Scholar

  • Figure 1

    (Color online) Motivation behind this study. Suppose that the tiger, cat, horse, and lion classes belong to the training set, while the dog class belongs to the test set. The dog's text description is noisy; the discriminative description is marked in red. The purpose of this study is to generate fake images of the corresponding test classes using semantic textual descriptions. Each fake image is semantically associated with the real images of seen classes, while their related information is transferred to unseen classes.

  • Figure 2

    (Color online) Overall framework of the proposed TDGAN model, where the arrows indicate data flow. It consists of two parts: D2GAN network and text reconstruction network.

  • Figure 3

    (Color online) Illustration of the D2GAN structure.

  • Figure 4

    (Color online) The proposed text reconstruction module.

  • Figure 5

    (Color online) The text reconstruction network focuses on important textual information, which is marked in red. The information includes color (pink, black, red and white), parts (breast, crest and patch) and textures.

  • Figure 6

    (Color online) The visualization of the outputs by the $G_1$ and $G_2$ in our TDGAN model by employing the two birds “pelagic cormorant" and “crested auklet" as examples. We could observe that the $G_1$ embodies the global information, while the $G_2$ captures the fine-grained local information.

  • Figure 7

    (Color online) The seen-unseen precision curves for each algorithm on two benchmark datasets with two split settings. (a) CUB with SCS splitting; (b) CUB with SCE splitting; (c) NAB with SCS splitting; (d) NAB with SCE splitting.

  • Figure 8

    (Color online) The influence of different weight parameters $\lambda_{\rm~SM}$ of SM entropy under SCE settings. (a) Synthetic-weight = 0.2; (b) Synthetic-weight = 0.5; (c) Synthetic-weight = 0.8.

  • Table 1  

    Table 1Main notations

    Symbol Meaning
    $N$ Number of instances
    $s$ Number of seen categories
    $u$ Number of unseen categories
    $V$ Dimensionality of visual space
    $Q$ Dimensionality of textual space
    $M$ Dimensionality of noise
    $x\in~\mathbb{R}{^{\textit{$V$}}}$ Visual representation vector
    $t\in~\mathbb{R}{^{\textit{$Q$}}}$ Textual representation vector
    $y\in~\mathbb{R}{^{s+u}}$ Label vector
    $z\in~\mathbb{R}{^{\textit{$M$}}}$ Noise representation vector
    $\theta$ Generator network parameters
    $\omega_1$, $\omega_2$ Dual discriminator network parameters
  • Table 2  

    Table 2The settings of CUB and NABirds datasets

    DatasetSCS-splitSCE-split
    TrainTestTrainTest
    NABirds [14]3238132381
    CUB [13]1505016040
  • Table 3  

    Table 3Top-1 average accuracy (%) of traditional ZSC for CUB and NABirds datasets under two split settings$^{\rm~a)}$

    DatasetCUBNABirds
    crlr)2-3lr)4-5SCSSCESCSSCEcrMCZSL [32]34.7–crWAC$_{\rm~linear}$ [15]27.05.0–crWAC$_{\rm~kernel}$ [15]33.57.711.46.0crESZSL [33]28.57.424.36.3crSJE [34]29.9–crZSLNS [18]29.17.324.56.8crSynC$_{\rm~fast}$ [10] 28.08.618.43.8crSynC$_{\rm~OVO}$ [10] 12.55.9–crZSLPP [19] 37.29.730.38.1crGAZSL [28] 43.710.335.68.6crTDGAN (ours)44.212.536.79.6cr

    a

  • Table 4  

    Table 4Ablation studies (%) of different components of the method on both CUB and NABirds datasets$^{\rm~a)}$

    MethodCUBNABirds
    crlr)2-3lr)4-5SCSSCESCSSCEcrGAZSL [28] 43.710.335.68.6crD2ZSL44.010.536.08.7crTRZSL43.811.535.89.0crTDGAN (ours)44.212.536.79.6cr