logo

SCIENCE CHINA Information Sciences, Volume 59, Issue 9: 092102(2016) https://doi.org/10.1007/s11432-015-0906-9

A novel unsupervised method for new word extraction

More info
  • ReceivedNov 8, 2015
  • AcceptedJan 6, 2016
  • PublishedAug 11, 2016

Abstract

New words could benefit many NLP tasks such as sentence chunking and sentiment analysis. However, automatic new word extraction is a challenging task because new words usually have no fixed language pattern, and even appear with the new meanings of existing words. To tackle these problems, this paper proposes a novel method to extract new words. It not only considers domain specificity, but also combines with multiple statistical language knowledge. First, we perform a filtering algorithm to obtain a candidate list of new words. Then, we employ the statistical language knowledge to extract the top ranked new words. Experimental results show that our proposed method is able to extract a large number of new words both in Chinese and English corpus, and notably outperforms the state-of-the-art methods. Moreover, we also demonstrate our method increases the accuracy of Chinese word segmentation by 10\% on corpus containing new words.


Funded by

National Natural Science Foundation of China(61201351)

National High Technology Research and Development Program of China(863 Program)

"source" : null , "contract" : "2015AA015404"

National Natural Science Foundation of China(61402036)

State Key Program of National Natural Science of China(61132009)


Acknowledgment

Acknowledgments

This work was supported by State Key Program of National Natural Science of China (Grant No. 61132009), National High Technology Research and Development Program of China (863 Program) (Grant No. 2015AA015404) and National Natural Science Foundation of China (Grant Nos. 61201351, 61402036).


References

[1] Sproat R, Emerson T. The first international Chinese word segmentation bakeoff. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003. 17: 133--143. Google Scholar

[2] Sun X, Wang H, Li W. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2012. 1: 253--262. Google Scholar

[3] Nie L, Yan S, Wang M, et al. Harvesting visual concepts for image search with complex queries. In: Proceedings of the 20th ACM International Conference on Multimedia. New York: ACM, 2012. 59--68. Google Scholar

[4] Huang M, Ye B, Wang Y, et al. New word detection for sentiment analysis. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, 2014. 531--541. Google Scholar

[5] Isozaki H. Japanese named entity recognition based on a simple rule generator and decision tree learning. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2001. 314--321. Google Scholar

[6] Chen K J, Ma W Y. Unknown word extraction for Chinese documents. In: Proceedings of the 19th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2002. 1: 1--7. Google Scholar

[7] Meng Y, Yu H, Nishino F. Chinese new word identification based on character parsing model. In: Proceedings of the 1st International Joint Conference on Natural Language Processing, Hainan, 2004. 489--496. Google Scholar

[8] Peng F, Feng F, McCallum A. Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2004. 562. Google Scholar

[9] Jiang X, Wang L, Cao Y, et al. Automatic recognition of Chinese unknown word for single-character and affix models. In: Knowledge Engineering and Management. Berlin: Springer, 2011. 435--444. Google Scholar

[10] He S, Zhu J. Bootstrap method for Chinese new words extraction. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Salt Lake City, 2001. 1: 581--584. Google Scholar

[11] Church K W, Hanks P. Word association norms, mutual information, and lexicography. Comput Linguist, 1990, 16: 22-29 Google Scholar

[12] Zhang W, Yoshida T, Tang X, et al. Improving effectiveness of mutual information for substantival multiword expression extraction. Expert Syst Appl, 2009, 36: 10919-10930 CrossRef Google Scholar

[13] Bu F, Zhu X, Li M. Measuring the non-compositionality of multiword expressions. In: Proceedings of the 23rd International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2010. 116--124. Google Scholar

[14] Luo S, Sun M. Two-character Chinese word extraction based on hybrid of internal and contextual measures. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003. 17: 24--30. Google Scholar

[15] Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers. In: Proceedings of the 5th Annual Workshop on Computational Learning Theory. New York: ACM, 1992. 144--152. Google Scholar

[16] Qu A P, Chen J M, Wang L W, et al. Segmentation of Hematoxylin-Eosin stained breast cancer histopathological images based on pixel-wise SVM classifier. Sci China Inf Sci, 2015, 58: 092105-10930 Google Scholar

[17] Zou B, Peng Z M, Xu Z B. The learning performance of support vector machine classification based on Markov sampling. Sci China Inf Sci, 2013, 56: 032110-10930 Google Scholar

[18] Chang C C, Lin C J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Tech (TIST), 2011, 2: 27-10930 Google Scholar

[19] Lafferty J, McCallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning. San Francisco: Morgan Kaufmann Publishers Inc, 2001. 282--289. Google Scholar

[20] Yi J, Peng Y X, Xiao J G. A temporal context model for boosting video annotation. Sci China Inf Sci, 2013, 56: 110904-10930 Google Scholar

[21] Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE, 1989, 77: 257-286 CrossRef Google Scholar

[22] Suk M, Ramadass A, Jin Y, et al. Video human motion recognition using a knowledge-based hybrid method based on a hidden Markov model. ACM Trans Intell Syst Tech, 2012, 3: 42-286 Google Scholar

[23] Hong F, Tang J W, Lu P P. Multichannel DEM reconstruction method based on Markov random fields for bistatic SAR. Sci China Inf Sci, 2015, 58: 062302-286 Google Scholar

[24] Xu Y S, Wang X, Tang B Z, et al. Chinese unknown word recognition using improved conditional random fields. In: Proceedings of the 8th International Conference on Intelligent Systems Design and Applications, Kaohsiung, 2008. 2: 363--367. Google Scholar

[25] Hu Q H, Guo M Z, Yu D R, et al. Information entropy for ordinal classification. Sci China Inf Sci, 2010, 53: 1188-1200 CrossRef Google Scholar

[26] Sun Y L, Tao J X, Chen H, et al. The entropy weighted non-uniform scanning algorithm for diffraction tomography. Sci China Inf Sci, 2015, 58: 067102-1200 Google Scholar

[27] Ding Y, Zhang Y, Wang X, et al. Perceptual image quality assessment metric using mutual information of Gabor features. Sci China Inf Sci, 2014, 57: 032111-1200 Google Scholar

[28] Li H, Huang C N, Gao J, et al. The use of SVM for Chinese new word identification. In: Natural Language Processing---IJCNLP 2004. Berlin: Springer, 2005. 723--732. Google Scholar

[29] Zhou G D. A chunking strategy towards unknown word detection in Chinese word segmentation. In: Proceedings of the 1st International Joint Conference on Natural Language Processing. Berlin: Springer, 2005. 530--541. Google Scholar

[30] Wu A D, Jiang Z X. Statistically-enhanced new word identification in a rule-based Chinese system. In: Proceedings of the 2nd Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2000. 12: 46--51. Google Scholar

[31] Liberman M, Davis K, Grossman M, et al. Emotional Prosody Speech and Transcripts. LDC2002S28. Philadelphia: Linguistic Data Consortium, 2002. Google Scholar

[32] Huang S D, Graff D, Doddington G. Multiple-Translation Chinese Corpus. LDC2002T01. Philadelphia: Linguistic Data Consortium, 2002. Google Scholar

[33] Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. Stroudsburg: Association for Computational Linguistics, 2003. 17: 184--187. Google Scholar

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1