logo

SCIENTIA SINICA Informationis, Volume 50 , Issue 6 : 845-861(2020) https://doi.org/10.1360/SSI-2019-0291

Entity summarization with high readability and low redundancy

More info
  • ReceivedDec 30, 2019
  • AcceptedApr 9, 2020
  • PublishedJun 8, 2020

Abstract

The development of the World Wide Web has triggered substantial growth of knowledge graphs (KG). Research into using KGs for intelligent applications has increased significantly. A KG describes facts about entities using RDF triples, and an entity description may contain a large number of triples. In applications where entity information is presented directly, entity summarization is required to prevent user information overload and to fit the presentation capacity. Here, the task is to select the most representative subset of triples from the rich entity description. In this paper, we propose an innovative entity summarization method, which we refer to as ESSTER, to generate summaries with both high readability and low redundancy. The proposed method combines structural and textual features. The importance of a triple is measured based on its structural features in the KG. The text readability of a triple is measured based on n-grams in a text corpus, and redundancy in a set of triples is measured by logical reasoning, numeric comparison, and text similarity. Entity summations is modeled and by combining these three measures and solved as a combinatorial optimization problem. We conducted experiments and compared the proposed method to six existing methods on two publicly available datasets of manually labeled summaries. Experimental results demonstrate that the proposed method achieves state of the art results.


Funded by

国家重点研发计划(2018YFB1004300)

国家自然科学基金(61772264)


References

[1] Zhang L, Zhang Y, Chen Y. Summarizing highly structured documents for effective search interaction. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2012. 145--154. Google Scholar

[2] Yan J, Wang Y, Gao M, Zhou A. Context-aware entity summarization. In: Proceedings of the 17th Internal Conference on Web-Age Information Management, Part I. Switzerland: Springer, 2016. 517--529. Google Scholar

[3] Hasibi F, Balog K, Bratsberg S E. Dynamic factual summaries for entity cards. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2017. 773--782. Google Scholar

[4] Cheng G, Xu D, Qu Y. Summarizing entity descriptions for effective and efficient human-centered entity linking. In: Proceedings of the 24th International Conference on World Wide Web. New York: ACM, 2015. 184--194. Google Scholar

[5] Cheng G, Xu D, Qu Y. C3D+P: A summarization method for interactive entity resolution. J Web Semantics, 2015, 35: 203-213 CrossRef Google Scholar

[6] Cheng G, Tran T, Qu Y. RELIN: relatedness and informativeness-based centrality for entity summarization. In: Proceedings of the 10th International Semantic Web Conference, Part I. Berlin: Springer, 2011. 114--129. Google Scholar

[7] Sydow M, Piku?a M, Schenkel R. The notion of diversity in graphical entity summarisation on semantic knowledge graphs. J Intell Inf Syst, 2013, 41: 109-149 CrossRef Google Scholar

[8] Gunaratna K, Thirunarayan K, Sheth A P. FACES: diversity-aware entity summarization using incremental hierarchical conceptual clustering. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. California: AAAI Press, 2015. 116--122. Google Scholar

[9] Gunaratna K, Thirunarayan K, Sheth A P, et al. Gleaning types for literals in RDF triples with application to entity summarization. In: Proceedings of the 13th European Semantic Web Conference. Switzerland: Springer, 2016. 85--100. Google Scholar

[10] Xu D, Zheng L, Qu Y. CD at ENSEC 2016: generating characteristic and diverse entity summaries. In: Proceedings of the 2nd International Workshop on Summarizing and Presenting Entities and Ontologies. Ruzica Piskac: ESUR-WS.org, 2016. Google Scholar

[11] Thalhammer A, Lasierra N, Rettinger A. LinkSUM: using link analysis to summarize entity data. In: Proceedings of the 16th International Conference on Web Engineering. Switzerland: Springer, 2016. 244--261. Google Scholar

[12] Stoilos G, Stamou G B, Kollias S D. A string metric for ontology alignment. In: Proceedings of the 4th International Semantic Web Conference. Berlin: Springer, 2005. 624--637. Google Scholar

[13] Pisinger D. The quadratic knapsack problem-a survey. Discrete Appl Math, 2007, 155: 623-648 CrossRef Google Scholar

[14] Yang Z, Wang G, Chu F. An effective GRASP and tabu search for the 0-1 quadratic knapsack problem. Comput Operations Res, 2013, 40: 1176-1185 CrossRef Google Scholar

[15] Thalhammer A, Toma I, Roa-Valverde A J, et al. Leveraging usage data for Linked Data movie entity summarization. 2012,. arXiv Google Scholar

[16] Li Y, Zhao L. A common property and special property entity summarization approach based on statistical distribution. In: Proceedings of the 2nd International Workshop on Summarizing and Presenting Entities and Ontologies. Ruzica Piskac: ESUR-WS.org, 2016. Google Scholar

[17] EventKG - the hub of event knowledge on the web - and biographical timeline generation. SW, 2019, 10: 1039-1070 CrossRef Google Scholar

[18] Tonon A, Catasta M, Prokofyev R. Contextualized ranking of entity types based on knowledge graphs. J Web Semantics, 2016, 37-38: 170-183 CrossRef Google Scholar

[19] Thalhammer A, Rettinger A. Browsing DBpedia entities with summaries. In: Proceedings of the 11th European Semantic Web Conference, Switzerland: Springer, 2014. 511--515. Google Scholar

[20] Jones KS. Automatic summarising: The state of the art. Inf Processing Manage, 2007, 43: 1449-1481 CrossRef Google Scholar

[21] Mihalcea R, Tarau P. TextRank: Bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2004. 404--411. Google Scholar

[22] Zhou Q, Yang N, Wei F, et al. Neural document summarization by jointly learning to score and select sentences. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018. 654--663. Google Scholar

[23] Carbonell J G, Goldstein J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1998. 335--336. Google Scholar

[24] Ganesan K, Zhai C, Viegas E. Micropinion generation: an unsupervised approach to generating ultra-concise summaries of opinions. In: Proceedings of the 21st World Wide Web Conference. New York: ACM, 2012. 869--878. Google Scholar

[25] Liu Y, Safavi T, Dighe A, et al. Graph summarization methods and applications: a survey. ACM Comput Surv, 2018, 51(3): 62:1-62:34. Google Scholar

  • Figure 1

    An example of knowledge graph (ovals and rectangles represent entities/classes and literals, respectively)

  • Figure 2

    (Color online) Cumulative distribution of Pearson correlation coefficient between weights and ideal importance scores. (a) $W_{\rm~struct}$; (b) $W_{\rm~text}$

  • Figure 3

    Summaries generated by entity summarizers for entity Hagar Wilde along with their F-measure scores.

  •   

    Algorithm 1 Redundancy

    Require:$t_i,~t_j~\in~{\rm~desc}(e)$;

    Output:${\rm~ovlp}(t_i,~t_j)$;

    if ${\rm~prop}(t_i)=\texttt{rdf:type}$ and ${\rm~prop}(t_j)=\texttt{rdf:type}$ and (subClassOf${\rm~val}(t_i),~{\rm~val}(t_j)$) or subClassOf${\rm~val}(t_j),~{\rm~val}(t_i)$)) then

    ${\rm~ovlp}(t_i,~t_j)~\Leftarrow~1$;ELSIF${\rm~val}(t_i)={\rm~val}(t_j)$ and ($\texttt{subPropertyOf}({\rm~prop}(t_i),~{\rm~prop}(t_j))$ or $\texttt{subPropertyOf}({\rm~prop}(t_j),~{\rm~prop}(t_i))$)

    ${\rm~ovlp}(t_i,~t_j)~\Leftarrow~1$;

    else

    ${\rm~sim}_p(t_i,~t_j)=\texttt{ISub}({\rm~prop}(t_i),~{\rm~prop}(t_j))$;

    if isNumber${\rm~val}(t_i)$) and isNumber${\rm~val}(t_j)$) then

    if ${\rm~val}(t_i)={\rm~val}(t_j)$ then

    ${\rm~sim}_v(t_i,~t_j)~\Leftarrow~1$;ELSIF${\rm~val}(t_i)\cdot~{\rm~val}(t_j)\leq~0$

    ${\rm~sim}_v(t_i,~t_j)~\Leftarrow~-1$;

    else

    ${\rm~sim}_v(t_i,~t_j)~\Leftarrow~\frac{{\rm~min}\{\|{\rm~val}(t_i)\|,~\|{\rm~val}(t_j)\|\}}{{\rm~max}\{\|{\rm~val}(t_i)\|,~\|{\rm~val}(t_j)\|\}}$;

    end if

    else

    ${\rm~sim}_v(t_i,~t_j)~\Leftarrow~\texttt{ISub}({\rm~val}(t_i),~{\rm~val}(t_j))$;

    end if

    ${\rm~ovlp}(t_i,~t_j)~\Leftarrow~\max\{{\rm~sim}_p(t_i,~t_j),~{\rm~sim}_v(t_i,~t_j),~0\}$;

    end if

  • Table 1   F-measure of entity summarizers$^{\rm~a)}$
    ESBM-D ESBM-L FED
    RELIN 0.242 ${(0.120)}$ 0.203 ${(0.125)}$ 0.127${(0.085)}$
    DIVERSUM 0.249 ${(0.136)}$ 0.207 ${(0.127)}$ 0.112 ${(0.078)}$
    FACES 0.270 ${(0.144)}$ 0.169 ${(0.085)}$ 0.145 ${(0.089)}$
    FACES-E 0.280 ${(0.142)}$ 0.313 ${(0.116)}$ 0.145 ${(0.089)}$
    CD 0.283 ${(0.134)}$ 0.217 ${(0.101)}$ 0.136 ${(0.076)}$
    LinkSUM 0.287 ${(0.132)}$ 0.140 ${(0.101)}$ 0.239 ${(0.121)}$
    ESSTER 0.305 ${(0.132)}$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\vartriangle$tiny$^\circ$tiny$^\circ$ 0.347 ${(0.077)}$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\vartriangle$tiny$^\blacktriangle$tiny$^\blacktriangle$ 0.229 ${(0.118)}$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\circ$

    a) Significant improvements of ESSTER over each baseline are indicated by $\blacktriangle$ ($p<0.01$) and $\vartriangle $ ($p<0.05$). Insignificant differences are indicated by $\circ$.

  • Table 2   F-measure of ablation study
    ESBM-D ESBM-L FED
    Mean diff $p$ Mean diff $p$ Mean diff $p$
    ESSTER 0.305 0.347 0.229
    ESSTER-S 0.264 $-$0.041 0.000 0.247 $-$0.101 0.000 0.140 $-$0.089 0.000
    ESSTER-T 0.298 $-$0.007 0.489 0.305 $-$0.042 0.001 0.218 $-$0.011 0.167
    ESSTER-R 0.222 $-$0.083 0.000 0.325 $-$0.022 0.025 0.211 $-$0.019 0.042
  • Table 3   Top-10 properties with highest or lowest readability in each dataset
    ESBM-D ESBM-L FED
    Highest Lowest Highest Lowest Highest Lowest
    time draft year made link source other population blank1 title
    long debut team subject filmid before timezone DST
    order IMDB id country story contributor after computing media
    number type of tennis surface date film story contributor years demonym
    course siler medalist language music contributor order sovereignty type
    name UTC offset type director directorid name languages2 type
    subject NRHP reference number page director name state cctId
    added route type abbreviation title director nytimes id parts location country
    result serving railway line writer actor name ground państwo
    position bionomial authority performance actor Netflix id country flaglink
  • Table 4   Redundancy of entity description, ideal entity summary and summaries generated by entity summarizers
    ESBM-D ESBM-L FED
    Desc 203.69 431.99 140.78
    Ideal 1.04 1.84 0.89
    RELIN 3.04 3.45 2.22
    DIVERSUM 0.39 1.29 1.64
    FACES 0.75 0.30 1.05
    FACES-E 1.29 0.76 1.05
    CD 0.02 0.00 0.00
    LinkSUM 2.45 4.72 1.47
    ESSTER 0.02 1.17 1.69

Copyright 2020  CHINA SCIENCE PUBLISHING & MEDIA LTD.  中国科技出版传媒股份有限公司  版权所有

京ICP备14028887号-23       京公网安备11010102003388号