SCIENTIA SINICA Informationis, Volume 47, Issue 8: 953(2017) https://doi.org/10.1360/N112017-00125

Survey of evaluation methods for dialogue systems}{Survey of evaluation methods for dialogue systems

More info
  • ReceivedApr 6, 2017
  • AcceptedJun 21, 2017
  • PublishedJul 24, 2017


This paper introduces the history of dialogue systems and their evaluation methods. The evaluation methods are categorized as either task-oriented dialogue systems or open domain dialogue systems. This paper investigates and summarizes the different methods of evaluating dialogue systems, analyzes the pros and cons of the different methods, discusses the emphasis of each method, and presents the progress of recent research for the two categories. For task-oriented dialogue systems, this paper introduces the recent research results of Steve Young. In addition, this paper sums up several widely used evaluation approaches. The evaluation methods for open domain chatting systems are explored from two angles: objective index evaluation and simulated artificial scoring. The various indices and different research ideas are analyzed and introduced as well. Finally, through summarizing and analyzing classical evaluation methods of dialogue systems as well as the newer evaluation methods based on neural network models, this study aims to predict developmental trends in evaluation methods for dialogue systems.

Funded by





[1] Turing A M. Computing machinery and intelligence. Mind, 1950, 59: 433-460. Google Scholar

[2] Walker M A, Litman D J, Kamm C A, et al. PARADISE: a framework for evaluating spoken dialogue agents. In: Proceeding of the 8th Conference on European Chapter of the Association for Computational Linguistics, Madrid, 1997. 271-280. Google Scholar

[3] Rieser V, Lemon O. Learning and evaluation of dialogue strategies for new applications: empirical methods for optimization from small data sets. Computat Linguist, 2011, 37: 153-196 CrossRef Google Scholar

[4] Larsen L B. Issues in the evaluation of spoken dialogue systems using objective and subjective measures. In: Proceedings of 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, St Thomas, 2003. 209-214. Google Scholar

[5] Yang Z J, Levow G, Meng H. Predicting user satisfaction in spoken dialog system evaluation with collaborative filtering. IEEE J Select Topics Signal Proc, 2012, 6: 971-981 CrossRef Google Scholar

[6] Asri L E, Laroche R, Pietquin O. Task completion transfer learning for reward inference. In: Proceeding of AAAI Workshop on Machine Learning for Interactive Systems, Quebec, 2014. 38-43. Google Scholar

[7] Ultes S, Minker W. Quality-adaptive spoken dialogue initiative selection and implications on reward modelling. In: Proceeding of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, 2015. 374-383. Google Scholar

[8] Su P H, Ga$\check{\text{s}}$i$\acute{\text{c}}$ M, Mrk$\check{\text{s}}$i$\acute{\text{c}}$ N, et al. On-line active reward learning for policy optimisation in spoken dialogue systems. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, 2016. 2431-2441. Google Scholar

[9] Young S, Ga$\check{\text{s}}$i$\acute{\text{c}}$ M, Thomson, B. POMDP-based statistical spoken dialogue systems: a review. Proc IEEE, 2012, 101: 1160-1179. Google Scholar

[10] Hirschman L, Thompson H S. Overview of evaluation in speech and natural language processing. In: Survey of the State of the Art in Human Language Technology. New York: Cambridge University Press, 1997. 409-414. Google Scholar

[11] Watambe T, Araki M, Doshita S. Evaluating dialogue strategies under communication errors using computer-to-computer simulation. IEICE Trans Inform Syst, 1998, E81-D: 1025-1033. Google Scholar

[12] Ai H, Weng F. User simulation as testing for spoken dialog systems. In: Proceedings of Annual Meeting of the Special Interest Group on Discourse and Dialogue, Columbus, 2008. 164-171. Google Scholar

[13] Schatzmann J. Statistical user and error modelling for spoken dialogue systems. Dissertation for Ph.D. Degree. Cambridge: University of Cambridge, 2008. Google Scholar

[14] Williams J. Applying POMDPs to dialog systems in the troubleshooting domain. In: Proceedings of the HLT/NAACL Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technology, New York, 2007. 1-8. Google Scholar

[15] Thomson B, Young S. Bayesian update of dialogue state: a POMDP framework for spoken dialogue systems. Comput Speech Language, 2010, 24: 562-588 CrossRef Google Scholar

[16] Henderson J, Lemon O, Georgila K. Hybrid reinforcement supervised learning for dialogue policies from communicator data. In: Proceedings of the IJCAI Workshop on Knowledge and Reasoning in Practical Dialog Systems, Edinburgh, 2005. 68-75. Google Scholar

[17] Ga$\check{\text{s}}$i$\acute{\text{c}}$ M, Lefevre F, Jur$\check{\text{c}}\text{i}\check{\text{c}}$ek F, et al. Back-off action selection in summary space-based POMDP dialogue systems. In: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, Merano, 2009. 456-461. Google Scholar

[18] Jur$\check{\text{c}}\text{i}\check{\text{c}}$ek F, Keizer S, Ga$\check{\text{s}}$i$\acute{\text{c}}$ M, et al. Real user evaluation of spoken dialogue systems using Amazon Mechanical Turk. In: Proceedings of Interspeech Conference, Florence, 2011. 3061-3064. Google Scholar

[19] McGraw I, Lee C, Hetherington L, et al. Collecting voices from the cloud. In: Proceedings International Conference on Language Resources and Evaluation, Malta, 2010. 1576-1583. Google Scholar

[20] Ga$\check{\text{s}}$i$\acute{\text{c}}$ M, Jur$\check{\text{c}}\text{i}\check{\text{c}}$ek F, Thomson B, et al. On-line policy optimisation of spoken dialogue systems via live interaction with human subjects. In: Proceedings of IEEE Workshop on Automatic Speech Recognition & Understanding, Hawaii, 2011. 312-317. Google Scholar

[21] Black A, Burger S, Conkie A, et al. Spoken dialog challenge 2010: comparison of live and control test results. In: Proceedings of Annual Meeting of the Special Interest Group on Discourse and Dialogue, Portland, 2011. 2-7. Google Scholar

[22] Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, 2002. 311-318. Google Scholar

[23] Banerjee S, Lavie A. METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, 2005. Google Scholar

[24] Lin C Y. Rouge: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out, Barcelona, 2004. 25-26. Google Scholar

[25] Rus V, Lintean M. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In: Proceedings of the 7th Workshop on Building Educational Applications Using NLP, Stroudsburg, 2012. 157-162. Google Scholar

[26] Wieting J, Bansal M, Gimpel K, et al. Towards universal paraphrastic sentence embeddings. arXiv: 1511.08198. Google Scholar

[27] Forgues G, Pineau J, Larcheveque J M, et al. Bootstrapping dialog systems with word embeddings. In: Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue, Cambridge, 2004. Google Scholar

[28] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. In: Proceedings of International Conference on Neural Information Processing Systems, Lake Tahoe, 2013. 3111-3119. Google Scholar

[29] Charlin L, Pineau J. How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv: 1603.08023v2. Google Scholar

[30] Anjuli Kannan, Oriol Vinyals. Adversarial Evaluation of Dialogue Models. arXiv: 1701.08198v1. Google Scholar

[31] Lowe R, Noseworthy M, Serban I V, et al. Towards an automatic turing test: learning to evaluate dialogue responses. 2017. In press. Google Scholar

[32] Lowe R, Serban I V, Noseworthy M, et al. On the evaluation of dialogue systems with next utterance classification. In: Proceedings of Annual Meeting of the Special Interest Group on Discourse and Dialogue, Los Angeles, 2016. Google Scholar

[33] Lowe N, Pow N, Serban J I, et al. The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In: Proceedings of Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, 2015. Google Scholar

[34] Ritter A, Cherry C, Dolan B. Unsupervised modeling of twitter conversations. In: Proceedings of Annual Conference on North American Chapter of the Association for Computational Linguistics (NAACL), Los Angeles, 2010. 172-180. Google Scholar

[35] Banchs R E. Movie-dic: a movie dialogue corpus for research and development. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Cincinnati, 2012. Google Scholar

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有