SCIENTIA SINICA Informationis, Volume 48, Issue 11: 1546-1557(2018) https://doi.org/10.1360/N112018-00210

Disease name recognition based on syntactic and semantic features

More info
  • ReceivedAug 10, 2018
  • AcceptedSep 19, 2018
  • PublishedNov 14, 2018


Biomedical entity recognition (such as genes, proteins, chemicals, diseases, etc.) is the foundation of biomedical text mining, which plays a significant role in extracting biomedical entity relations and constructing biomedical knowledge bases. To deal with existing issues of the current disease name recognition systems, this paper proposes a series of new syntactic and semantic features to improve disease name recognition. The syntactic features include chunk and dependency information, while the semantic features include the disease abbreviation form, its dictionary entry form, and hyponymy relationships between disease concepts. Experiments over the NCBI disease corpus show the CRF model, combined with these syntactic and semantic features, can significantly improve the state-of-the-art performance of disease entity recognition, achieving an F1 score of 85.3%.

Funded by




[1] Song M, Yu H, Han W S. Developing a hybrid dictionary-based bio-entity recognition technique.. BMC Med Inform Decis Mak, 2015, 15: S9 CrossRef PubMed Google Scholar

[2] McCray A T, Srinivasan S, Browne A C. Lexical methods for managing variation in biomedical terminologies. In: Proceedings of the Annual Symposium on Computer Application in Medical Care, 1994. 235. Google Scholar

[3] Bunescu R, Ge R, Kate R J. Comparative experiments on learning information extractors for proteins and their interactions.. Artificial Intelligence Med, 2005, 33: 139-155 CrossRef PubMed Google Scholar

[4] Wang H, Zhao T, Tan H, et al. Biomedical Named Entity Recognition Based on Classifiers Ensemble. Int J Comput Sci Appl, 2008, 5: 1--11. Google Scholar

[5] Leaman R, Wei C H, Lu Z. tmChem: a high performance approach for chemical named entity recognition and normalization.. J Cheminf, 2015, 7: S3 CrossRef PubMed Google Scholar

[6] Wei C H, Kao H Y, Lu Z. GNormPlus: an integrative approach for tagging genes, gene families, and protein domains. Biomed Res Int, 2015, 2015: 918710. Google Scholar

[7] Cortes C, Vapnik V. Support-vector networks. Machine Learn, 1995, 20: 273--297. Google Scholar

[8] Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE, 1989, 77: 257-286 CrossRef Google Scholar

[9] Ratnaparkhi A. A simple introduction to maximum entropy models for natural language processing. IRCS Technical Reports Series, 1997, 81: 1--14. Google Scholar

[10] Lafferty J, McCallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, San Francisco, 2001. 282--289. Google Scholar

[11] Leaman R, Gonzalez G. BANNER: an executable survey of advances in biomedical named entity recognition. Biocomputing, 2008, 13: 652--663. Google Scholar

[12] Leaman R, Islamaj Dogan R, Lu Z. DNorm: disease name normalization with pairwise learning to rank.. Bioinformatics, 2013, 29: 2909-2917 CrossRef PubMed Google Scholar

[13] Do?an R I, Leaman R, Lu Z. NCBI disease corpus: a resource for disease name recognition and concept normalization.. J BioMed Inf, 2014, 47: 1-10 CrossRef PubMed Google Scholar

[14] Leaman R, Lu Z. TaggerOne: joint named entity recognition and normalization with semi-Markov Models.. Bioinformatics, 2016, 32: 2839-2846 CrossRef PubMed Google Scholar

[15] A transition-based joint model for disease named entity recognition and normalization.. Bioinformatics, 2017, 33: 2363-2371 CrossRef PubMed Google Scholar

[16] Yao L, Liu H, Liu Y, et al. Biomedical named entity recognition based on deep neutral network. Corpus, 2015, 8: 279--288. Google Scholar

[17] Luo L, Yang Z, Yang P, et al. An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics, 2017, 1: 8. Google Scholar

[18] Xu K, Zhou Z, Hao T, et al. A bidirectional LSTM and conditional random fields approach to medical named entity recognition. In: Proceedings of International Conference on Advanced In-telligent Systems and Informatics. Berlin: Springer, 2017. 355--365. Google Scholar

[19] Sahu S K, Anand A. Recurrent neural network models for disease name recognition using domain invariant features. 2016,. arXiv Google Scholar

[20] Mikolov T, Karafiát M, Burget L, et al. Recurrent neural network based language model. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, 2010. Google Scholar

[21] Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation, 1997, 9: 1735-1780 CrossRef Google Scholar

[22] Santos C D, Zadrozny B. Learning character-level representations for part-of-speech tag-ging. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014. 1818--1826. Google Scholar

[23] Dang T H, Le H Q, Nguyen T M, et al. D3NER: biomedical named entity recognition using CRF-biLSTM improved with fine-tuned embeddings of various linguistic information. Bioinformatics, 2018, 1: 8. Google Scholar

[24] Miyao Y, Saetre R, Sagae K, et al. Task-oriented evaluation of syntactic parsers and their representations. In: Proceedings of ACL-08: HLT, 2008. 46--54. Google Scholar

[25] Sohn S, Comeau D C, Kim W. Abbreviation definition identification based on automatic precision estimates.. BMC BioInf, 2008, 9: 402 CrossRef PubMed Google Scholar

  • Figure 1

    Dependency parse tree

  • 1   Table 1Lexical features for disease name recognition
    Word POS Stem Lemma Affix Context
    Rheumatic JJ rheumat Rheumatic R, Rh, Rhe, c, ic, tic for, Diseases
    Diseases NN diseas Disease D, Di, Dis, s, es, ses Rheumatic, database
  • 2   Table 2Chunk features
    Feature name Feature type Feature meaning
    Chunk SBIEO Chunking tag of the current word
  • 3   Table 3Dependency features
    Feature name Feature type Feature meaning
    Dependency String Headword of the current word
  • 4   Table 4Syntactic feature set
    Word Chunk feature Dependency feature
    Twins S-NP be
    with S-PP Twin
    AS S-NP with
    were B-VP be
    identified E-VP be
    from S-PP identify
  • 5   Table 5Abbreviation feature set
    Feature name Feature type Feature meaning
    IS_ABB Binary Whether the current word is an abbreviation
    IS_SF Binary Whether the short form of the abbreviation is the name of a disease
    IS_LF Binary Whether the long form of the abbreviation is the name of a disease
    Headword String Headword of the long form in the abbreviation
  • 6   Table 6The feature set of semantic code
    Feature name Feature type Feature meaning
    MEDIC SBIEO Whether the current word sequence appears in MEDIC
    MeSH SBIE+NONE Classification code of the current word sequence in MeSH tree
  • 7   Table 7Semantic feature set
    Word Abbreviation features Semantic code features Tag
    Twins 0, 0, 0, None O, S-M01.438.873 O
    with 0, 0, 0, None O, NONE O
    AS 1, 1, 1, spondylitis S_Disease, NONE S_Disease
    were 0, 0, 0, None O, NONE O
    identified 0, 0, 0, None O, NONE O
    from 0, 0, 0, None O, NONE O
    the 0, 0, 0, None O, NONE O
    Royal 0, 0, 0, None O, NONE O
    National 0, 0, 0, None O, NONE O
    Hospital 0, 0, 0, None O, NONE O
    for 0, 0, 0, None O, NONE O
    Rheumatic 0, 0, 0, None B_Disease, B-C17.300.775 B_Disease
    Disease 0, 0, 0, None E_Disease, E-C17.300.775 E_Disease
    database 0, 0, 0, None O, S-V02.300 O
  • 8   Table 8NCBI dataset
    Statistics Train set Dev. set Test set
    Abstracts 593 100 100
    Sentences 5818 958 1080
    Entities 5145 787 960
  • 9   Table 9Performance contributions of syntactic and semantic feature sets
    Features $P$ (%) $R$ (%) $F1$ (%)
    Baseline 86.16 78.44 82.12
    +Chunk 86.10 79.38 82.60
    +Dependency 87.05 79.79 83.26
    +Abbreviation 87.35 81.52 84.19
    +Semantic_codes 87.43 83.33 85.33
  • 10   Table 10Performance contributions of abbreviation features
    Features $P$ (%) $R$ (%) $F1$ (%)
    Baseline+SYN 87.05 79.79 83.26
    +IS_ABB 87.02 81.04 83.93
    +IS_SF 86.27 81.15 83.63
    +IS_LF 86.98 80.73 83.74
    +Headword 87.35 81.25 84.19
  • 11   Table 11Performance contributions of semantic features
    Features $P$ (%) $R$ (%) $F1$ (%)
    Baseline+SYN+ABB 87.35 81.25 84.19
    +MEDIC 87.23 83.23 85.18
    +MeSH 87.43 83.33 85.33
  • 12   Table 12Performance comparison with the state-of-the-art systems
    Systems Methods $P$ (%) $R$ (%) $F1$ (%)
    BANNER [11] CRF 82.2 77.5 79.8
    TaggerOne [14] Joint Learning 85.1 80.8 82.9
    Lou et al. [15] Joint Learning 90.7 74.9 82.1
    Sahu et al. [19] Bi-LSTM 84.9 74.1 79.1
    D3NER [23] Bi-LSTM-CRF 85.0 83.8 84.4
    Ours CRF 87.4 83.3 85.3

Copyright 2019 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有