logo

SCIENTIA SINICA Informationis, Volume 49, Issue 10: 1299-1320(2019) https://doi.org/10.1360/N112018-00312

A decadal survey of zero-shot image classification

More info
  • ReceivedMar 5, 2019
  • AcceptedJun 3, 2019
  • PublishedOct 16, 2019

Abstract

Zero-shot image classification refers to learning a visual classifier for categories with zero training examples. This method can effectively solve problems in which the labeled data for some classes are absent and has therefore gained a considerable attention recently. It has been approximately a decade since this technology was first developed. This paper systematically summarizes the research progress over the past decade in this field. First, we introduce the significance and practical application value of zero-shot image classification. Next, the research processes and typical approaches are summarized in detail. Further, we comprehensively review existing datasets and evaluation metrics, together with the relation between zero-shot image classification and other related techniques. Finally, we analyze the hot spots and existing challenges that need to be further studied and emphasize the future trends in this research area.


Funded by

国家自然科学基金(61171329,61632018)


References

[1] Larochelle H, Erhan D, Bengio Y. Zero-data learning of new tasks. In: Proceedings of AAAI Conference on Artificial Intelligence, Chicago, 2008. 646--651. Google Scholar

[2] Palatucci M, Pomerleau D, Hinton G E. Zero-shot learning with semantic output codes. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2009. 1410--1418. Google Scholar

[3] Lampert C H, Nickisch H, Harmeling S. Learning to detect unseen object classes by between-class attribute transfer. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Miami, 2009. 951--958. Google Scholar

[4] Rohrbach M, Stark M, Schiele B. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Colorado, 2011. 1641--1648. Google Scholar

[5] Habibian A, Mensink T, Snoek C G M. Video2vec Embeddings Recognize Events When Examples Are Scarce.. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2089-2103 CrossRef PubMed Google Scholar

[6] Wang R G, Ding K, Yang J, et al. Image classification based on bag of visual words model with triangle constraint. Ruan Jian Xue Bao/Journal of Software, 2017, 28(7):1847--1861 (in Chinese) DOI: 10.13328/j.cnki.jos.005069. Google Scholar

[7] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, 2012. 1097--1105. Google Scholar

[8] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521: 436--444. Google Scholar

[9] Biederman I. Recognition-by-components: a theory of human image understanding.. Psychological Rev, 1987, 94: 115-147 CrossRef PubMed Google Scholar

[10] Li F F, Rob F, Pietro P. A Bayesian approach to unsupervised one-shot learning of object categories. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Madison, 2003. 1134--1141. Google Scholar

[11] Ahsan U, Sun C, Hays J, et al. Complex event recognition from images with few training examples. In: Proceedings of IEEE Winter Conference on Applications of Computer Vision, Santa Rosa, 2017. 669--678. Google Scholar

[12] Socher R, Ganjoo M, Bastani H, et al. Zero-shot learning through cross-modal transfer. In: Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, 2013. 935--943. Google Scholar

[13] Ji Z, Yu Y L, Pang Y W. Manifold regularized cross-modal embedding for zero-shot learning. Inf Sci, 2017, 378: 48-58 CrossRef Google Scholar

[14] Elliott D, Kiela D, Lazaridou A. Multimodal learning and reasoning. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, Berlin, 2016. Google Scholar

[15] Zhang Y, Gong B, Shah M. Fast zero-shot image tagging. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 5985--5994. Google Scholar

[16] Yang Y, Luo Y D, Chen W L, et al. Zero-shot hashing via transferring supervised knowledge. In: Proceedings of ACM International Conference on Multimedia, Amsterdam, 2016. 1286--1295. Google Scholar

[17] Guo Y C, Ding G G, Han J G, et al. SitNet: discrete similarity transfer network for zero-shot hashing. In: Proceedings of International Joint Conference on Artificial Intelligence, Melbourne, 2017. 1767--1773. Google Scholar

[18] Liu W, Mei T, Zhang Y D, et al. Multi-task deep visual-semantic embedding for video thumbnail selection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 3707--3715. Google Scholar

[19] Xu B H, Fu Y W, Jiang Y G. Heterogeneous Knowledge Transfer in Video Emotion Recognition, Attribution and Summarization. IEEE Trans Affective Comput, 2018, 9: 255-270 CrossRef Google Scholar

[20] Wang Z, Hu R M, Liang C. Zero-Shot Person Re-identification via Cross-View Consistency. IEEE Trans Multimedia, 2016, 18: 260-272 CrossRef Google Scholar

[21] Teney D, Hengel A V D. Zero-shot visual question answering. 2016,. arXiv Google Scholar

[22] Wang H, Liang X D, Zhang H, et al. ZM-Net: real-time zero-shot image manipulation network. arXiv preprint. 2017,. arXiv Google Scholar

[23] Bansal A, Sikka K, Sharma G, et al. Zero-Shot Object Detection. arXiv preprint. 2018,. arXiv Google Scholar

[24] Dauphin Y N, Tur G, Hakkani-Tür D, et al. Zero-shot learning for semantic utterance classification. In: Proceedings of International Conference on Learning Representations, Banff, 2014. Google Scholar

[25] Johnson M, Schuster M, Le Q V. Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Trans Association Comput Linguistics, 2017, 5: 339-351 CrossRef Google Scholar

[26] Shao L, Zhu F, Li X L. Transfer learning for visual categorization: a survey.. IEEE Trans Neural Netw Learning Syst, 2015, 26: 1019-1034 CrossRef PubMed Google Scholar

[27] Pan S J, Yang Q. A Survey on Transfer Learning. IEEE Trans Knowl Data Eng, 2010, 22: 1345-1359 CrossRef Google Scholar

[28] Patel V M, Gopalan R, Li R. Visual Domain Adaptation: A survey of recent advances. IEEE Signal Process Mag, 2015, 32: 53-69 CrossRef ADS Google Scholar

[29] Deng L, Seltzer M L, Yu D, et al. Binary coding of speech spectrograms using a deep auto-encoder. In: Proceedings of Annual Conference of the International Speech Communication Association, Makuhari, 2010. 1692--1695. Google Scholar

[30] Boulanger-Lewandowski N, Bengio Y, Vincent P. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In: Proceedings of International Conference on Machine Learning, Edinburgh, 2012. Google Scholar

[31] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, 2013. 3111--3119. Google Scholar

[32] Tzeng E, Hoffman J, Saenko K, et al. Adversarial discriminative domain adaptation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 2962--2971. Google Scholar

[33] Sun Y, Chen Y H, Wang X G, et al. Deep learning face representation by joint identification-verification. In: Proceedings of Advance in Neural Information Processing Systems, Montreal, 2014. 1988--1996. Google Scholar

[34] Wu P C, Hoi S C H, Xia H, et al. Online multimodal deep similarity learning with application to image retrieval, In: Proceedings of ACM International Conference on Multimedia, Barcelona, 2013. 153--162. Google Scholar

[35] Li H X, Li Y, Porikli F. DeepTrack: Learning Discriminative Feature Representations Online for Robust Visual Tracking. IEEE Trans Image Process, 2016, 25: 1834-1848 CrossRef PubMed ADS arXiv Google Scholar

[36] Liong V E, Lu J W, Tan Y P. Deep Coupled Metric Learning for Cross-Modal Matching. IEEE Trans Multimedia, 2017, 19: 1234-1244 CrossRef Google Scholar

[37] Xian Y Q, Schiele B, Akata Z. Zero-shot learning-the good, the bad and the ugly. In: Proceedings of IEEE conference on Computer vision and pattern recognition, Honolulu, 2017. 3077--3086. Google Scholar

[38] Fu Y W, Xiang T, Jiang Y G. Recent Advances in Zero-Shot Recognition: Toward Data-Efficient Understanding of Visual Content. IEEE Signal Process Mag, 2018, 35: 112-125 CrossRef ADS Google Scholar

[39] Wang W, Zheng V W, Yu H. A Survey of Zero-Shot Learning. ACM Trans Intell Syst Technol, 2019, 10: 1-37 CrossRef Google Scholar

[40] Farhadi A, Endres I, Hoiem D, et al. Describing objects by their attributes. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Miami, 2009. 1778--1785. Google Scholar

[41] Yu F X, Cao L L, Feris R S, et al. Designing category-level attributes for discriminative visual recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern, Portland, 2013. 771--778. Google Scholar

[42] Parikh D, Grauman K. Relative attributes. In: Proceedings of IEEE International Conference on Computer Vision, Barcelona, 2011. 503--510. Google Scholar

[43] Alexander S, Forsyth D. Utility data annotation with Amazon Mechanical Turk. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 2008. 1--8. Google Scholar

[44] Liu J G, Kuipers B, Savarese S. Recognizing human actions by attributes. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Colorado, 2011. 3337--3344. Google Scholar

[45] Frome A, Corrado G S, Shlens J, et al. Devise: A deep visual-semantic embedding model. In: Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, 2013. 2121--2129. Google Scholar

[46] Akata Z, Reed S, Walter D, et al. Evaluation of output embeddings for fine-grained image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 2927--2936. Google Scholar

[47] Fu Y W, Hospedales T M, Xiang T. Transductive multi-view zero-shot learning.. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 2332-2345 CrossRef PubMed Google Scholar

[48] Wang X S, Chen C, Cheng Y H. Zero-Shot Image Classification Based on Deep Feature Extraction. IEEE Trans Cogn Dev Syst, 2018, 10: 432-444 CrossRef Google Scholar

[49] Pennington J, Socher R, Manning C D. Glove: global vectors for word representation. In: Proceedings of Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, 2014. 1532--1543. Google Scholar

[50] Reed S, Akata Z, Honglak L, et al. Learning deep representations of fine-grained visual descriptions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 49--58. Google Scholar

[51] Karessli N, Akata Z, Schiele B, et al. Gaze embeddings for zero-shot image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 6412--6421. Google Scholar

[52] Lampert C H, Nickisch H, Harmeling S. Attribute-based classification for zero-shot visual object categorization.. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 453-465 CrossRef PubMed Google Scholar

[53] Jayaraman D, Kristen G. Zero-shot recognition with unreliable attributes. In: Proceedings of International Conference on Neural Information Processing Systems, Montreal, 2014. 3464--3472. Google Scholar

[54] Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Inf Processing Manage, 1988, 24: 513-523 CrossRef Google Scholar

[55] Wah C, Branson S, Welinder P, et al. The Caltech-ucsd birds-200--2011 Dataset. Technical Report CNS-TR-2011--001, California Institute of Technology, 2011. Google Scholar

[56] Fu Y W, Sigal L. Semi-supervised vocabulary-informed learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 5337--5346. Google Scholar

[57] Lazaridou A, Bruni E, Baroni M. Is this a wampimuk? cross-modal mapping between distributional semantics and the visual world. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, Baltimore, 2014. 1403--1414. Google Scholar

[58] Fu Y W, Hospedales T M, Xiang T. Learning multimodal latent attributes.. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 303-316 CrossRef PubMed Google Scholar

[59] Huang S, Elhoseiny M, Elgammal A, et al. Learning hypergraph-regularized attribute predictors. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 409--417. Google Scholar

[60] Jayaraman D, Sha F, Grauman K. Decorrelating semantic visual attributes by resisting the urge to share. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014. 1629--1636. Google Scholar

[61] Akata Z, Perronnin F, Harchaoui Z, et al. Label-embedding for attribute-based classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Portland, 2013. 819--826. Google Scholar

[62] Norouzi M, Mikolov T, Bengio S, et al. Zero-shot learning by convex combination of semantic embeddings. In: Proceedings of International Conference on Learning Representations, Banff, 2014. Google Scholar

[63] Gan C, Lin M, Yang Y, et al. Exploring semantic inter-class relationships(SIR) for zero-shot action recognition. In: Proceedings of AAAI Conference on Artificial Intelligence, Austin, 2015. 3769--3775. Google Scholar

[64] Xu X, Hospedales T, Gong S G. Semantic embedding space for zero-shot action recognition. In: Proceedings of IEEE International Conference on Image Processing, Quebec City, 2015. 63--67. Google Scholar

[65] Xian Y Q, Akata Z, Sharma G, et al. Latent Embeddings for Zero-Shot Classification. In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 69--77. Google Scholar

[66] Fu Z Y, Xiang T A, Kodirov E, et al. Zero-shot object recognition by semantic manifold distance. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015. 2635--2644. Google Scholar

[67] Fu Y W, Yang Y X, Hospedales T, et al. Transductive multi-label zero-shot learning. In: Proceedings of British Machine Vision Association, Swansea, 2015. 37: 2332--2345. Google Scholar

[68] Zhang L, Xiang T, Gong S G. Learning a deep embedding model for zero-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 3010--3019. Google Scholar

[69] Yu Y, Ji Z, Li X. Transductive Zero-Shot Learning With a Self-Training Dictionary Approach.. IEEE Trans Cybern, 2018, 48: 2908-2919 CrossRef PubMed Google Scholar

[70] Shojaee S M, Baghshah M. Semi-supervised zero-shot learning by a clustering-based approach. 2016,. arXiv Google Scholar

[71] Shigeto Y, Suzuki I, Hara K, et al. Ridge regression, hubness, and zero-shot learning. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Porto, 2015. 135--151. Google Scholar

[72] Changpinyo S, Chao W L, Gong B Q, at el. Synthesized classifiers for zero-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 5327--5336. Google Scholar

[73] Romera-Paredes B, Torr P H S. An embarrassingly simple approach to zero-shot learning. In: Proceedings of International Conference on Machine Learning, Lille, 2015. 2152--2161. Google Scholar

[74] Guo Y C, Ding G G, Jin X M, et al. Transductive zero-shot recognition via shared model space learning. In: Proceedings of AAAI Conference on Artificial Intelligence, Phoenix, 2016. 3--8. Google Scholar

[75] Yang Y X, Hospedales T. A unified perspective on multi-domain and multi-task learning. In: Proceedings of International Conference on Learning Representations, San Diego, 2015. 1--9. Google Scholar

[76] Ba J, Swersky K, Fidler S, et al. Predicting deep zero-shot convolutional neural networks using textual descriptions. In: Proceedings of IEEE International Conference on Computer Vision, Santiago, 2015. 4247--4255. Google Scholar

[77] Xian Y Q, Lorenz T, Schiele B, et al. Feature generating networks for zero-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 5542--5551. Google Scholar

[78] Long Y, Liu L, Shao L, et al. From zero-shot learning to conventional supervised classification: Unseen visual data synthesis. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 1627--1636. Google Scholar

[79] Long Y, Liu L, Shen F. Zero-Shot Learning Using Synthesised Unseen Visual Data with Diffusion Regularisation.. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 2498-2512 CrossRef PubMed Google Scholar

[80] Zhu Y Z, Elhoseiny M, Liu B C, et al. A generative adversarial approach for zero-shot learning from noisy texts. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 1004--1013. Google Scholar

[81] Kumar V V, Arora G, Mishra A, et al. Generalized zero-shot learning via synthesized examples. In: Proceedings of IEEE conference on computer vision and pattern recognition, Salt Lake City, 2018. 4281--4289. Google Scholar

[82] Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in neural information processing systems, Montreal, 2014. 2672--2680. Google Scholar

[83] Felix R, Kumar V B G, Reid I, et al. Multi-modal cycle-consistent generalized zero-shot learning. In: Proceedings of European Conference on Computer Vision, Munich, 2018. 21--37. Google Scholar

[84] Shen T X, Lei T, Barzilay R, et al. Style transfer from non-allel text by cross-alignment. In: Proceedings of Advances in neural information processing systems, Long Beach, 2017: 6830--6841. Google Scholar

[85] Liu M Y, Breuel T, Kautz J. Unsupervised image-to-image translation networks. In: Proceedings of Advances in Neural Information Processing Systems, Long Beach, 2017. 700--708. Google Scholar

[86] Patterson G, Hays J. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Providence, 2012. 2751--2758. Google Scholar

[87] Rohrbach M, Stark M, Szarvas G, et al. What helps where-and why? Semantic relatedness for knowledge transfer. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, 2010. 910--917. Google Scholar

[88] Zhang L, Wang P, Liu L, et al. Towards Effective Deep Embedding for Zero-Shot Learning. arXiv preprint. 2018,. arXiv Google Scholar

[89] Arora G D, Verma V K, Mishra A, et al. Generalized zero-shot learning via synthesized examples. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 2018. 4281--4289. Google Scholar

[90] Wei Y C, Zhao Y, Lu C Y. Cross-Modal Retrieval With CNN Visual Features: A New Baseline.. IEEE Trans Cybern, 2016, : 1-12 CrossRef PubMed Google Scholar

[91] Markou M, Singh S. Novelty detection: a review-t 1: statistical approaches. Signal Processing, 2003, 83: 2481-2497 CrossRef Google Scholar

[92] Zhai S F, Chen Y, Lu W N, et al. Deep structured energy based models for anomaly detection. In: Proceedings of International Conference on Machine Learning, New York City, 2016. 19--24. Google Scholar

[93] Sharmanska V, Quadrianto N, Lampert C. Augmented attribute representations. In: Proceedings of European Conference on Computer Vision, Florence, 2012. 242--255. Google Scholar

[94] Kodirov E, Xiang T, Fu Z Y, et al. Unsupervised domain adaptation for zero-shot learning. In: Proceedings of IEEE International Conference on Computer Vision, Santiago, 2015. 2452--2460. Google Scholar

[95] Li Y, Wang D H, Hu H H, et al. Zero-shot recognitionusing dual visual-semantic mapping paths. In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 5207--5215. Google Scholar

[96] Changpinyo S, Chao W L, Gong B, et al. Synthesized classifiers for zero-shot learning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 5327--5336. Google Scholar

[97] Marco B, Angeliki L, Georgiana D. Hubness and pollution: delving into cross-space mapping for zero-shot learning. In: Proceedings of Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, 2015. 270--280. Google Scholar

[98] Dinu G, Lazaridou A, Baroni M. Improving zero-shot learning by mitigating the hubness problem. In: Proceedings of International Conference on Learning Representations, San Diego, 2015. 1--10. Google Scholar

[99] Low T, Borgelt C, Stober S, et al. The hubness phenomenon: fact or artifact. Towards Advanced Data Analysis by Combining Soft Computing and Statistics, Berlin, 2013:267--278. Google Scholar

[100] Elhoseiny M, Liu J, Cheng H, et al. Zero-shot event detection by multimodal distributional semantic embedding of videos. In: Proceedings of AAAI Conference on Artificial Intelligence, Phoenix, 2016. 3478--3486. Google Scholar

[101] Chao W L, Changpinyo S, Gong B Q, et al. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In: Proceedings of European Conference on Computer Vision, Amsterdam, 2016. 52--68. Google Scholar

[102] Scheirer W J, de Rezende Rocha A, Sapkota A. Toward open set recognition.. IEEE Trans Pattern Anal Mach Intell, 2013, 35: 1757-1772 CrossRef PubMed Google Scholar

[103] Scheirer W J, Jain L P, Boult T E. Probability Models for Open Set Recognition.. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 2317-2324 CrossRef PubMed Google Scholar

[104] Jain L P, Scheirer W J, Boult T E. Multi-class open set recognition using probability of inclusion. In: Proceedings of European Conference on Computer Vision, Zurich, 2014. 393--409. Google Scholar

[105] Bendale A, and Boult T. Towards open set deep networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 1563--1572. Google Scholar

[106] Zhao B, Wu B T, Wu T F, et al. Zero-shot learning via revealing data distribution. 2017,. arXiv Google Scholar

[107] Rudd E M, Jain L P, Scheirer W J. The Extreme Value Machine.. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 762-768 CrossRef PubMed Google Scholar

[108] Fu Y W, Dong H Z, Ma Y, et al. Vocabulary-informed extreme value learning. 2017,. arXiv Google Scholar

[109] Gan C, Yang Y, Zhu L C. Recognizing an Action Using Its Name: A Knowledge-Based Approach. Int J Comput Vis, 2016, 120: 61-77 CrossRef Google Scholar

[110] Tsai Y H H, Huang L K, Salakhutdinov R. Learning robust visual-semantic embeddings. In: Proceedings of IEEE International Conference on Computer Vision, Venice, 2017: 3591--3600. Google Scholar

[111] Isele D, Rostami M, Eaton E. Using task features for zero-shot knowledge transfer in lifelong learning. In: Proceedings of International Joint Conference on Artificial Intelligence, New York, 2016. 1620--1626. Google Scholar

[112] Mnih V, Kavukcuoglu K, Silver D. Human-level control through deep reinforcement learning. Nature, 2015, 518: 529-533 CrossRef PubMed ADS Google Scholar

[113] Oh J, Singh S, Lee H, et al. Zero-shot task generalization with multi-task deep reinforcement learning. In: Proceedings of International Conference on Machine Learning, Sydney, 2017. 2661--2670. Google Scholar

[114] Higgins I, Pal A, Rusu A A, et al. Darla: Improving zero-shot transfer in einforcement learning. In: Proceedings of International Conference on Machine Learning, Sydney, 2017. 1480--1490. Google Scholar

[115] Liang X D, Lee L S Y, Xing E P. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017. 848--857. Google Scholar

[116] Over P, Fiscus J, Sanders G, et al. TRECVID 2014: an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID, Orlando, 2014. 52. Google Scholar

[117] Chang X J, Yang Y, Hauptmann A G, et al. Semantic concept discovery for large-scale zero-shot event detection. In: Proceedings of International Joint Conference on Artificial Intelligence, Buenos Aires, 2015. 2234--2240. Google Scholar

[118] Xu X, Hospedales T, Gong S G. Transductive Zero-Shot Action Recognition by Word-Vector Embedding. Int J Comput Vis, 2017, 123: 309-333 CrossRef Google Scholar

  • Figure 1

    (Color online) Difference between zero-shot classification and traditional classification task. (a) Zero-shot classification; (b) traditional object classification.

  • Figure 2

    (Color online) The development and trend of object classification

  • Figure 3

    (Color online) Graphical representation of (a) DAP and (b) IAP [3]

  • Figure 4

    The comparison between diffenrent categories of zero-shot image classification approaches

  • Figure 5

    (Color online) The technical frameworks of various categories of zero-shot image classification approaches. protectłinebreak (a) Direct semantic predicting based; (b) embedding based; (c) visual data generation based.

  • Figure 6

    (Color online) Example images in AwA dataset [3]

  • Figure 7

    (Color online) An illustration of the domain shift problem in zero-shot image classification. (a) Visual space; (b) attribute space.

  • Table 1   The different kinds of auxiliary information adopted in zero-shot image classification
    Auxiliary information Advantages Disadvantages
    Human-defined
    Attribute based
    High accuracy;
    strong interpretability
    High cost for designing attribute;
    strong subjectivity
    Non-attribute based
    Learning-based
    Label embedding based
    Free of human annotation;
    more natural
    Weak interpretability;
    influenced by noise
    Textual embedding based
  • Table 2   Popular datasets in zero-shot image classification
    Dataset Numbers of classes Numbers of instances Numbers of attributes Annotation level SoA
    AwA 50 30475 85 Per class 85.3 [88]
    aPY 32 15339 64 Per image 39.8 [45]
    CUB-200-2011 200 11788 312 Per image 67.8 [88]
    SUN-attribute 717 14340 102 Per image 62.4 [88]
    ImageNet 22000 15000000 Per class 25.4 [89]
  • Table 3   Difference between zero-shot classification and four related techniques
    Cross-modal learning Domain adaptation One-shot learning Anomaly detection Zero-shot classification
    Cross-domain $\times$ $\surd$ $\times$ $\surd$ $\surd$
    Cross-modal $\surd$ $\times$ $\times$ $\times$ $\surd$
    Cross-class $\times$ $\times$ $\times$ $\surd$ $\surd$

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1