logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 7 : 1003-1018(2020) https://doi.org/10.1360/SSI-2019-0243

Attribute identification for Chinese inter-personal relation knowledge graph based on encyclopedic text

More info
  • ReceivedNov 1, 2019
  • AcceptedMar 23, 2020
  • PublishedJul 13, 2020

Abstract

An accurate and rich inter-personal relation knowledge graph (KG) not only provides a clear introduction of persons and the interconnections among them but also provide knowledge support for the intelligent service system. At present, most KGs are based on the encyclopedia tabular data. In this article, we mainly describe how to make full use of encyclopedic text to build a high-quality inter-personal relation KG. For solving the problem of missing attributes and errors in tabular data, we propose a method of combining pattern matching and deep learning models to extract attribute information from text data for attribute identification. The experimental results show that our method can effectively improve the coverage and accuracy of KGs.


Funded by

国家自然科学基金(61525205,61876115)


References

[1] Zhong X Q, Liu Z, Ding P P. Construction of knowledge base on hybird reasoning and its application. Chin J Comput, 2012, 35: 761-766 CrossRef Google Scholar

[2] Pujara J, Miao H, Getoor L, et al. Knowledge graph identification. In: Proceedings of International Semantic Web Conference, Berlin, 2013. 542--557. Google Scholar

[3] Liu Q, Li Y, Duan H, et al. Knowledge graph construction techniques. J Comput Res Dev, 2016, 53: 582--600. Google Scholar

[4] Cui W Y, Xiao Y H, Wang H X, et al. KBQA: learning question answering over QA corpora and knowledge bases. In: Proceedings of the VLDB Endowment, Munich, 2017. 565--576. Google Scholar

[5] Abney S. Bootstrapping. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, 2002. 360--367. Google Scholar

[6] Zhao J, Liu K, He S Z, et al. Knowledge Graph. Beijing: Higher Education Press, 2018. Google Scholar

[7] Miller G A. WordNet: a lexical database for English. Commun ACM, 1995, 38: 39-41 CrossRef Google Scholar

[8] Dong Z D, Dong Q. HowNet-a hybrid language and knowledge resource. In: Proceedings of International Conference on Natural Language Processing and Knowledge Engineering, Toulouse, 2003. 820--824. Google Scholar

[9] Mendes P, Jakob M, Bizer C. DBpedia: a multilingual cross-domain knowledge base. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), Istanbul, 2012. 1813--1817. Google Scholar

[10] Auer S, Bizer C, Kobilarov G, et al. Dbpedia: a nucleus for a web of open data. In: Proceedings of International Semantic Web Conference, 2007. 722--735. Google Scholar

[11] Suchanek F M. YAGO: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web (WWW), New York, 2007. 697--706. Google Scholar

[12] Rebele T, Suchanek F, Hoffart J, et al. YAGO: a multilingual knowledge base from wikipedia, wordnet, and geonames. In: Proceedings of International Semantic Web Conference (ISWC), Kobe, 2016. 177--185. Google Scholar

[13] Bollacker K, Cook R, Tufts P. Freebase: a shared database of structured general human knowledge. In: Proceedings of the 22nd AAAI Conference on Artificial Intelligence, Vancouver, 2007. 1962--1963. Google Scholar

[14] Bollacker K, Evans C, Paritosh P, et al. Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of ACM SIGMOD International Conference on Management of Data, Vancouver, 2008. 1247--1250. Google Scholar

[15] Huang R H, Su X B. Development of information organization in social network environment. Libr Tribune, 2011, 31: 190--198. Google Scholar

[16] Niu X, Sun X R, Wang H F, et al. Zhishi.me: weaving chinese linking open data. In: Proceedings of International Semantic Web Conference, San Francisco, 2011. 205--220. Google Scholar

[17] Xu B, Xu Y, Liang J Q, et al. CN-DBpedia: a never-ending chinese knowledge extraction system. In: Proceedings of International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, 2017. 428--438. Google Scholar

[18] Carlson A, Betteridge J, Kisiel B, et al. Toward an architecture for never-ending language learning. In: Proceedings of the 24th AAAI Conference on Artificial Intelligence, Atlanta, 2010. 1306--1313. Google Scholar

[19] Miller E. An Introduction to the Resource Description Framework. D-Lib Mag, 1998, 4 CrossRef Google Scholar

[20] Shen W, Wang J, Han J. Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions. IEEE Trans Knowl Data Eng, 2015, 27: 443-460 CrossRef Google Scholar

[21] Chieu H L, Ng H T. Named entity recognition with a maximum entropy approach. In: Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL, Edmonton, 2003. 160--163. Google Scholar

[22] Lafferty J, McCallum A, Pereira F C N. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, 2001. 282--289. Google Scholar

[23] Lample G, Ballesteros M, Subramanian S, et al. Neural architectures for named entity recognition. In: Proceedings of NAACL-HLT, San Diego, 2016. 260--270. Google Scholar

[24] Strubell E, Verga P, Belanger D, et al. Fast and accurate entity recognition with iterated dilated convolutions. In: Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, 2017. Google Scholar

[25] Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF models for sequence tagging. 2015,. arXiv Google Scholar

[26] Berger A L, Pietra V J D, Pietra S A D. A maximum entropy approach to natural language processing. Comput Linguist, 1996, 22: 39--71. Google Scholar

[27] Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, 2014. 1746--1751. Google Scholar

[28] Wang J H, Liu T W, Luo X, et al. An lstm approach to short text sentiment classification with word embeddings. In: Proceedings of the 30th Conference on Computational Linguistics and Speech Processing, Taiwan, 2018. 214--223. Google Scholar

[29] Mintz M, Bills S, Snow R, et al. Distant supervision for relation extraction without labeled data. In: Proceedings of Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 2009. 1003--1011. Google Scholar

[30] Yang Y S, Chen W L, Li Z H, et al. Distantly supervised ner with partial annotation learning and reinforcement learning. In: Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, 2018. 2159--2169. Google Scholar

[31] He Z Q, Chen W L, Li Z H, et al. SEE: syntax-aware entity embedding for neural relation extraction. 2018,. arXiv Google Scholar

  • Figure 1

    Flow chart of inter-personal relation knowledge graph construction

  • Figure 2

    (Color online) Expressions related to “teacher"

  • Figure 3

    (Color online) Part of the inter-personal relation tree

  • Figure 4

    (Color online) Char vector representation

  • Figure 5

    (Color online) Feature extraction of Five-Stroke coding based on CNN

  • Figure 6

    Flow chart of alias recognition module

  • Table 1   Extraction result of entity “Yao Ming"
    Entry ID Entry name Entry information (InfoBox) Entry label
    中文名 外文名 $\ldots$ 生肖 星座
    28 姚明 姚明 Yao Ming $\ldots$ 处女座 运动员, 篮球, 体育人物
  • Table 2   Common attributes & labels of persons
    Attribute Frequency Entry label Frequency
    国籍 735080 人物 781199
    职业 466398 政治人物 167416
    民族 421832 学者 102575
    性别 314391 体育人物 97445
    毕业院校 228019 娱乐人物 71672
  • Table 3   Diversity of “date of birth"
    Source Frequency
    出生日期 62038
    出生时间 21462
    出生年月 12227
    生日 8552
  • Table 4   Examples of attribute value normalization
    Type of value Original format Normalized format
    Time type 1992.9.9/1992-9-9 1992-09-09
    /1992年9月9日
    182 cm/182 CM/182 公分 182 厘米
    Number type 56 kg/56 Kg/56 公斤 56 千克
    1.82 m/1.82 M 1.82 米
    String type 1米82/一米八二 1.82 米
    「红楼梦」 /【红楼梦】 《红楼梦》
  • Table 5   Example of entry introduction text
    Entry ID Entry name Entry introduction text
    28 姚明 姚明(Yao Ming), 1980年9月12日 出生于上海市徐汇区,
    祖籍江苏省 苏州市吴江区震泽镇, 前中国职业 篮球运动员$\ldots$
  • Table 6   Some pattern examples for identifying the value of “height"
    Pattern Example
    [1-2]1[0-9]1,2(cm$\mid$CM$\mid$Cm$\mid$cM$\mid$厘米$\mid$公分) 183 cm/183 CM/183 公分
    [1-2]1($\backslash$.[0-9]*)?(米$\mid$公尺$\mid$m)([0-9]*)? 1米83/1.83 m/1米/1 m
    (([一二]$\mid$[1-2])1)(米)([一二三四五六七八九]1,2$\mid$[0-9]*) 一米八/1米八三/一米83/1米83
    ([一二三四五六七八九])(尺$\mid$英尺)([一二三四五六七八九]1(寸$\mid$英寸)?)? 一英尺三英寸/三尺八
  • Table 7   Common sources of “alias"
    Source Frequency Source Frequency
    中文名 1015465 字号 22960
    外文名 163707 其他名称 9451
    别名 74988 英文名 9010
    本名 43595 别称 7585
  • Table 8   Some examples of alias value
    Entry name Alias string Result after cleaning
    亚利克斯・维加 阿丽夏・维加 、 Alex 、 Lex 阿丽夏・维加, Alex, Lex
    默森 全名: Paul Mohsen Paul Mohsen
    水莲寺璐珈 水莲寺璐珈 (水莲寺流歌) 水莲寺流歌
    观月小鸟 Mizuki Kotori (罗马音) Tori Meadows (英文名) Mizuki Kotori, Tori Meadows
    窦士镛 字晓湘号警凡 晓湘, 警凡
  • Table 9   Pattern extraction of general aliases
    Entry name Alias Entry intro
    李白 诗仙 S S S 李白 , 唐 代 伟 大 浪 漫 主 义 诗 人 , 人 们 称 之 为 诗仙 . E E E
  • Table 10   Pattern extraction with entities/aliases nearby
    Entry name Alias Entry intro
    泰颉 朱德其, par 三友轩主 S S S 泰颉 本 名 朱德其 , 号 三友轩主 , 男 , 1 9 4 0 年 1 月 生 . E E E
  • Table 11   Some examples of alias pattern
    Pattern Frequency
    , 原 名 ## , 1 9 1504
    清 字 ## , 江 苏 335
    $\backslash$entity , 字 ## , 生 卒 225
    名 $\backslash$alias , ## 等 . E 62
  • Table 12   Experimental results of different models on distant supervised data
    Model Precision (%) Recall (%) $F_1$ score (%)
    CRF 75.07 80.07 77.49
    LSTM-CRF 80.03 80.79 80.41
    IDCNN-CRF 80.13 81.40 80.76
    FS-BiLSTM-CRF 83.95 81.04 82.47
  • Table 13   Experimental results of rule-based attribute recognition
    Attribute Range Sampling accuracy (500) (%)
    身高 Numbers, letters and Chinese 99.20
    体重 Numbers, letters and Chinese 96.20
    三围 Numbers, letters 95.00
    星座 Chinese 96.00
    血型 Letters, Chinese 97.40
    政治面貌 Chinese 98.60
  • Table 14   Experimental results of model-based attribute recognition
    Attribute Range Category Total number Sampling accuracy (500) (%)
    出生日期 Numbers, Chinese A 627454 99.00
    B 114467 99.00
    C 72348 100.00
    D 32856 96.20 93.40
    运动项目 Numbers, Chinese A 64773 96.80
    B 56743 95.80
    C 41918 99.80
    D 716 96.40 95.60
    出生地 Chinese A 62474 97.20
    B 113837 93.80
    C 31327 99.00
    D 20889 97.00 87.20
    学位 Chinese A 51374 93.40
    B 218215 91.00
    C 30153 97.40
    D 21221 87.40 84.60
    毕业院校 Chinese, English A 234951 94.20
    B 87901 96.80
    C 10620 99.60
    D 7475 92.80 89.20
    民族 Chinese A 396757 94.60
    B 25389 93.00
    C 14567 99.80
    D 2362 93.20 88.60
  • Table 15   Coverage comparison before and after attribute completion
    Attribute Original coverage (%) New coverage (%)/$\bigtriangleup$ (%) Attribute Original coverage (%) New coverage (%)/$\bigtriangleup$ (%)
    学位 4.61 24.21/+19.60 民族 35.63 37.02/+1.39
    出生地 5.61 15.83/+10.22 身高 7.93 8.32/+0.39
    出生日期 56.36 66.05/+9.69 体重 5.80 6.06/+0.26
    毕业院校 21.10 28.78/+7.68 三围 0.27 0.51/+0.24
    运动项目 5.81 10.21/+4.40 星座 2.58 2.65/+0.07
    政治面貌 2.80 4.45/+1.65 血型 0.02 0.03/+0.01
  • Table 16   Experimental results of alias extraction
    Model Precision (%) Recall (%) $F_1$ score (%)
    MaxEnt 98.07 38.93 55.73
    CNN 94.59 44.52 60.55
    LSTM 95.85 47.07 63.14

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号