SCIENTIA SINICA Informationis, Volume 50 , Issue 6 : 845-861(2020) https://doi.org/10.1360/SSI-2019-0291

Entity summarization with high readability and low redundancy

More info
  • ReceivedDec 30, 2019
  • AcceptedApr 9, 2020
  • PublishedJun 8, 2020


The development of the World Wide Web has triggered substantial growth of knowledge graphs (KG). Research into using KGs for intelligent applications has increased significantly. A KG describes facts about entities using RDF triples, and an entity description may contain a large number of triples. In applications where entity information is presented directly, entity summarization is required to prevent user information overload and to fit the presentation capacity. Here, the task is to select the most representative subset of triples from the rich entity description. In this paper, we propose an innovative entity summarization method, which we refer to as ESSTER, to generate summaries with both high readability and low redundancy. The proposed method combines structural and textual features. The importance of a triple is measured based on its structural features in the KG. The text readability of a triple is measured based on n-grams in a text corpus, and redundancy in a set of triples is measured by logical reasoning, numeric comparison, and text similarity. Entity summations is modeled and by combining these three measures and solved as a combinatorial optimization problem. We conducted experiments and compared the proposed method to six existing methods on two publicly available datasets of manually labeled summaries. Experimental results demonstrate that the proposed method achieves state of the art results.

  • Figure 1

    An example of knowledge graph (ovals and rectangles represent entities/classes and literals, respectively)

  • Figure 2

    (Color online) Cumulative distribution of Pearson correlation coefficient between weights and ideal importance scores. (a) $W_{\rm~struct}$; (b) $W_{\rm~text}$

  • Figure 3

    Summaries generated by entity summarizers for entity Hagar Wilde along with their F-measure scores.


    Algorithm 1 Redundancy



    if ${\rm~prop}(t_i)=\texttt{rdf:type}$ and ${\rm~prop}(t_j)=\texttt{rdf:type}$ and (subClassOf${\rm~val}(t_i),~{\rm~val}(t_j)$) or subClassOf${\rm~val}(t_j),~{\rm~val}(t_i)$)) then

    ${\rm~ovlp}(t_i,~t_j)~\Leftarrow~1$;ELSIF${\rm~val}(t_i)={\rm~val}(t_j)$ and ($\texttt{subPropertyOf}({\rm~prop}(t_i),~{\rm~prop}(t_j))$ or $\texttt{subPropertyOf}({\rm~prop}(t_j),~{\rm~prop}(t_i))$)




    if isNumber${\rm~val}(t_i)$) and isNumber${\rm~val}(t_j)$) then

    if ${\rm~val}(t_i)={\rm~val}(t_j)$ then





    end if



    end if


    end if

  • Table 1   F-measure of entity summarizers$^{\rm~a)}$
    RELIN 0.242 ${(0.120)}$ 0.203 ${(0.125)}$ 0.127${(0.085)}$
    DIVERSUM 0.249 ${(0.136)}$ 0.207 ${(0.127)}$ 0.112 ${(0.078)}$
    FACES 0.270 ${(0.144)}$ 0.169 ${(0.085)}$ 0.145 ${(0.089)}$
    FACES-E 0.280 ${(0.142)}$ 0.313 ${(0.116)}$ 0.145 ${(0.089)}$
    CD 0.283 ${(0.134)}$ 0.217 ${(0.101)}$ 0.136 ${(0.076)}$
    LinkSUM 0.287 ${(0.132)}$ 0.140 ${(0.101)}$ 0.239 ${(0.121)}$
    ESSTER 0.305 ${(0.132)}$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\vartriangle$tiny$^\circ$tiny$^\circ$ 0.347 ${(0.077)}$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\vartriangle$tiny$^\blacktriangle$tiny$^\blacktriangle$ 0.229 ${(0.118)}$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\blacktriangle$tiny$^\circ$

    a) Significant improvements of ESSTER over each baseline are indicated by $\blacktriangle$ ($p<0.01$) and $\vartriangle $ ($p<0.05$). Insignificant differences are indicated by $\circ$.

  • Table 2   F-measure of ablation study
    Mean diff $p$ Mean diff $p$ Mean diff $p$
    ESSTER 0.305 0.347 0.229
    ESSTER-S 0.264 $-$0.041 0.000 0.247 $-$0.101 0.000 0.140 $-$0.089 0.000
    ESSTER-T 0.298 $-$0.007 0.489 0.305 $-$0.042 0.001 0.218 $-$0.011 0.167
    ESSTER-R 0.222 $-$0.083 0.000 0.325 $-$0.022 0.025 0.211 $-$0.019 0.042
  • Table 3   Top-10 properties with highest or lowest readability in each dataset
    Highest Lowest Highest Lowest Highest Lowest
    time draft year made link source other population blank1 title
    long debut team subject filmid before timezone DST
    order IMDB id country story contributor after computing media
    number type of tennis surface date film story contributor years demonym
    course siler medalist language music contributor order sovereignty type
    name UTC offset type director directorid name languages2 type
    subject NRHP reference number page director name state cctId
    added route type abbreviation title director nytimes id parts location country
    result serving railway line writer actor name ground państwo
    position bionomial authority performance actor Netflix id country flaglink
  • Table 4   Redundancy of entity description, ideal entity summary and summaries generated by entity summarizers
    Desc 203.69 431.99 140.78
    Ideal 1.04 1.84 0.89
    RELIN 3.04 3.45 2.22
    DIVERSUM 0.39 1.29 1.64
    FACES 0.75 0.30 1.05
    FACES-E 1.29 0.76 1.05
    CD 0.02 0.00 0.00
    LinkSUM 2.45 4.72 1.47
    ESSTER 0.02 1.17 1.69

