logo

SCIENTIA SINICA Informationis, Volume 47, Issue 8: 1078(2017) https://doi.org/10.1360/N112016-00305

Online Web news extraction via tag path feature weighted by text block density

More info
  • ReceivedApr 5, 2017
  • AcceptedJun 8, 2017
  • PublishedAug 16, 2017

Abstract

Web news extraction is the basis and an open research problem of many “big data” and “big knowledge” applications. Presently, tag paths and text block density are two excellent features that can help to solve this problem. The tag path feature can distinguish well the content from the noise for the whole webpage, but it has difficulty in recognizing noise in the content block or the content in the noise block. The text block density feature can recognize well the high-density content block, but it is not robust enough. Aiming at the abovementioned problems, we propose a Web information extraction model, referred to as CEDP, which can effectively combine the tag path feature and the text block density feature. We design a tag path feature weighted by the text block density in order to utilize the merits of the two features above. In addition, we design a Web news extraction method via the weighted tag path feature, CEDP-NLTD. CEDP-NLTD is a fast, universal, non-training, online Web news extraction algorithm that is suitable for extracting heterogeneous Web news from the big data environment of the Web across various resources, styles, and languages. Experiments on public datasets such as CleanEval show that the CEDP-NLTD method achieves better performance than the state-of-the-art CETR, CETD, CEPR, and CEPF methods, and it achieves better performance than CEDP-TD, CEDP-CTD, and CEDP-DSum, which are respectively generated from CEDP by using one of the three block density features of CETD.


Funded by

国家重点研发计划(2016YFB1000901)

教育部创新团队发展计划(IRT13059)

国家自然科学基金(612-łinebreak73297,61673152)

国家留学基金(201506695019)


References

[1] CNNIC. Statistical report on Internet development in China. Technical report. China Internet Network Information Center, 2016. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201608/P020160803367337470363.pdf . Google Scholar

[2] Xindong Wu , Xingquan Zhu , Gong-Qing Wu . Data mining with big data. IEEE Trans Knowl Data Eng, 2014, 26: 97-107 CrossRef Google Scholar

[3] Wu X, Chen H, Wu G, et al. Knowledge engineering with big data. IEEE Intell Syst, 2015, 30: 46--55. Google Scholar

[4] Li X L, Gong H G. A survey on big data systems. Sci Sin Inform, 2015, 45: 1--44 . Google Scholar

[5] Glynn C J, Herbst S, Lindeman M, et al. Public Opinion. Colorado: Westview Press, 2015. Google Scholar

[6] Zhu C, Zhu H, Ge Y. Tracking the evolution of social emotions with topic models. Knowl Inf Syst, 2016, 47: 517-544 CrossRef Google Scholar

[7] Etzioni O, Fader A, Christensen J, et al. Open information extraction: the second generation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, 2011. 1: 3--10. Google Scholar

[8] Zhao J, Liu K, Zhou G Y, et al. Open information extraction. J Chinese Inform Proces, 2011, 25: 98--111 . Google Scholar

[9] Parapar J, Barreiro A. An effective and efficient web news extraction technique for an operational NewsIR system. In: Proceedings of the Conferencia de la Asociación Espanola para la Inteligencia Artificial CAEPIA-TTIA, Salamanca, 2007. 2: 319--328. Google Scholar

[10] Weninger T, Palacios R, Crescenzi V. Web Content Extraction. SIGKDD Explor Newsl, 2016, 17: 17-23 CrossRef Google Scholar

[11] Chia-Hui Chang , Kayed M, Girgis M R. A Survey of Web Information Extraction Systems. IEEE Trans Knowl Data Eng, 2006, 18: 1411-1428 CrossRef Google Scholar

[12] Ferrara E, De Meo P, Fiumara G. Web data extraction, applications and techniques: A survey. Knowledge-Based Syst, 2014, 70: 301-323 CrossRef Google Scholar

[13] Jiménez P, Corchuelo R. Roller: a novel approach to Web information extraction. Knowl Inf Syst, 2016, 49: 197-241 CrossRef Google Scholar

[14] Wu G Q, Hu J, Li L, et al. Online web news extraction via tag path feature fusion. J Softw, 2016, 27: 714--735 . Google Scholar

[15] Sun F, Song D, Liao L. Dom based content extraction via text density. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Beijing, 2011. 245--254. Google Scholar

[16] Wu X D, He J, Lu Y Q, et al. From big data to big knowledge: HACE+BigKE. Acta Autom Sin, 2016, 42: 965--982 . Google Scholar

[17] Xue Y, Hu Y, Xin G. Web page title extraction and its application. Inf Processing Manage, 2007, 43: 1332-1347 CrossRef Google Scholar

[18] Zhao X, Jin P, Yue L. Discovering topic time from web news. Inf Processing Manage, 2015, 51: 869-890 CrossRef Google Scholar

[19] Garcia-Molina H, Hammer J, McHugh J. Semistructured data: the TSIMMIS experience. In: Proceedings of the 1st East-European Conference on Advances in Databases and Information systems, St Petersburg, 1997. 1--8. Google Scholar

[20] Sahuguet A, Azavant F. Building intelligent Web applications using lightweight wrappers. Data Knowledge Eng, 2001, 36: 283-316 CrossRef Google Scholar

[21] Liu L, Pu C, Han W. XWRAP: an XML-enabled wrapper construction system for web information sources. In: Proceedings of the 16th International Conference on Data Engineering, San Diego, 2000. 611--621. Google Scholar

[22] Wu G, Wu X. Extracting web news using tag path patterns. In: Proceedings of IEEE International Conference on Web Intelligence and Intelligent Agent Technology, Macau, 2012. 1: 588--595. Google Scholar

[23] Wu S, Liu J, Fan J. Automatic web content extraction by combination of learning and grouping. In: Proceedings of the 24th International Conference on World Wide Web, Florence, 2015. 1264--1274. Google Scholar

[24] Dalvi N, Bohannon P, Sha F. Robust web extraction: an approach based on a probabilistic tree-edit model. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, Providence, 2009. 335--348. Google Scholar

[25] Parameswaran A, Dalvi N, Garcia-Molina H, et al. Optimal schemes for robust web extraction. In: Proceedings of the VLDB Endowment, Seattle, 2011. 4: 980--991. Google Scholar

[26] Hogue A, Karger D. Thresher: automating the unwrapping of semantic content from the World Wide Web. In: Proceedings of the 14th International Conference on World Wide Web, Chiba, 2005. 86--95. Google Scholar

[27] Alarte J, Insa D, Silva J, et al. TeMex: the web template extractor. In: Proceedings of the 24th International Conference on World Wide Web, Florence, 2015. 155--158. Google Scholar

[28] Wu G, Li L, Hu X, et al. Web news extraction via path ratios. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, San Francisco, 2013. 2059--2068. Google Scholar

[29] Peters M E, Lecocq D. Content extraction using diverse feature sets. In: Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, 2013. 89--90. Google Scholar

[30] Otsu N. A threshold selection method from gray-level histograms. Automatica, 1975, 11: 23--27. Google Scholar

[31] Baroni M, Chantree F, Kilgarriff A, et al. Cleaneval: a competition for cleaning web pages. In: Proceedings of the International Conference on Language Resources and Evaluation, Marrakech, 2008. 638--643. Google Scholar

[32] Pan Q, Yu X, Cheng Y, et al. Essential methods and progress of information fusion theory. Acta Autom Sin, 2003, 29: 599--615 . Google Scholar

[33] Pan Q, Wang Z F, Liang Y, et al. Basic methods and progress of information fusion (II). Control Theory Appl, 2012, 29: 1233--1244 . Google Scholar

  • Figure 1

    Average extraction time of each algorithm

  • Table 1   $F\_{\rm score}$ for each algorithm in each source (the best sources are marked in bold)
    DataSet CETR CEPR-AT CETD-QT CETD-Jsoup CEPF CEDP-TD CEDP-CTD CEDP-DSum CEDP-NLTD
    (%) (%) (%) (%) (%) (%) (%) (%) (%)
    CleanEval-En 88.30 75.33 83.89 91.49 88.39 88.38 66.10 66.78 88.24
    CleanEval-Zh 83.36 75.65 N/A 87.69 86.86 87.04 74.13 74.07 86.94
    NY Post 58.19 81.02 82.78 83.97 90.04 89.00 89.29 90.19 89.36
    Freep 70.36 86.00 74.41 74.65 87.79 88.97 90.18 90.70 90.16
    Suntimes 82.20 85.90 90.37 90.07 94.08 94.41 95.03 95.12 94.58
    Techweb 74.56 88.86 77.86 77.35 90.70 91.00 89.53 90.70 91.08
    Tribune 89.83 90.32 N/A 95.47 95.21 94.86 95.08 94.90 94.93
    Nytimes 91.14 86.91 96.26 96.25 92.31 92.07 91.09 90.50 92.16
    BBC 72.76 80.13 89.45 84.85 89.53 90.26 90.56 90.87 90.65
    Reuters 71.73 84.26 N/A 77.67 94.40 94.24 78.09 79.22 94.35
    Yahoo 82.06 84.96 89.85 85.88 89.33 90.91 90.42 91.75 91.10
    Sina 73.99 90.63 N/A 89.44 96.92 97.36 97.33 97.45 97.34
    People 86.23 85.32 82.40 82.14 89.27 95.08 95.10 94.84 95.22
    163 38.28 88.56 N/A 53.15 79.84 80.44 80.63 80.16 80.57
    Xinhua 83.32 81.24 91.18 90.57 95.08 94.84 94.82 94.92 94.82
    TunxunWb 79.36 17.75 86.28 86.90 83.72 87.99 86.34 86.71 87.17
    SinaWb 57.99 18.88 N/A 77.01 79.38 85.56 82.57 79.90 85.88
    SohuWb 87.16 92.32 N/A 87.57 93.51 92.22 92.28 92.38 92.27
    Average 76.16 77.45 N/A 84.01 89.80 90.81 87.70 87.84 90.93
  •   

    Algorithm 1 CEDP

    Require:网页${\rm wp}, {\rm densityF}(\cdot), {\rm pathF}(\cdot), {\rm combine}(\cdot,\cdot,\cdot), {\rm thresh}(\cdot), {\rm smoothing}(\cdot)$;

    Output:文本内容${\rm content}$;

    解析${\rm wp}$得到解析树$T_{{\rm wp}}$, ${\rm content} \gets ``";$

    ${\rm nts}\gets \langle (v_1, c(v_1)),(v_2, c(v_2)),\ldots,(v_n, c(v_n)) \rangle$ //计算$T_{{\rm wp}}$的(节点, 内容) –规范文本节点序列${\rm nts}$;

    for $i = 1 {\rm to} n$ do

    $f(v_i) \gets {\rm combine}(v_i,{\rm densityF}(\cdot),{\rm pathF}(\cdot))$;

    ${\rm nfts} \gets \langle(f(v_1), c(v_1)),(f(v_2), c(v_2)),\ldots,(f(v_n), c(v_n)) \rangle$;

    ${\rm s\_nfts} \gets {\rm smoothing}({\rm nfts})$, 并记${\rm s\_nfts}$为$\langle (sf(v_1), c(v_1)),(sf(v_2), c(v_2)),\ldots,(sf(v_n), c(v_n)) \rangle$;

    for $i = 1 {\rm to} n$ do

    if $sf(v_i) \ge {\rm thresh}(T_{{\rm wp}})$ then

    ${\rm content} \gets {\rm content} + c(v_i);$

    outputrm content.

  • Table 2   Precision for each algorithm in each source (the best sources are marked in bold)
    DataSet CETR CEPR-AT CETD-QT CETD-Jsoup CEPF CEDP-TD CEDP-CTD CEDP-DSum CEDP-NLTD
    (%) (%) (%) (%) (%) (%) (%) (%) (%)
    CleanEval-En 89.42 95.96 75.16 89.86 92.93 93.39 70.80 70.07 93.70
    CleanEval-Zh 79.60 85.23 N/A 85.18 88.37 88.82 80.32 79.89 88.90
    NY Post 42.68 98.28 72.91 74.76 92.10 90.80 91.38 93.24 91.42
    Freep 57.38 86.75 59.52 59.92 80.66 83.07 85.67 85.65 85.26
    Suntimes 70.82 98.10 83.01 82.64 94.73 95.28 96.04 96.24 95.44
    Techweb 60.29 94.04 63.88 63.36 86.96 87.22 87.38 87.22 87.39
    Tribune 84.99 99.43 N/A 92.49 96.32 96.26 96.64 96.34 96.31
    Nytimes 88.98 99.73 96.53 95.86 96.13 96.85 96.93 96.37 96.91
    BBC 60.11 97.83 81.93 74.57 91.75 93.60 94.80 95.07 94.60
    Reuters 58.48 98.13 N/A 63.65 95.41 95.05 79.30 81.04 95.27
    Yahoo 72.67 97.83 83.63 76.62 93.04 93.29 93.40 94.70 93.41
    Sina 59.09 98.57 N/A 81.93 96.29 97.64 97.72 97.74 97.65
    People 77.04 95.11 70.21 69.74 84.47 96.14 97.13 95.02 96.65
    163 24.40 99.29 N/A 37.72 73.59 74.47 74.90 74.45 74.66
    Xinhua 72.10 94.72 84.39 83.28 96.26 96.51 96.60 96.49 96.59
    TunxunWb 66.37 94.25 88.89 88.40 99.36 98.87 98.87 98.87 98.87
    SinaWb 41.81 86.84 N/A 64.43 66.90 77.86 85.34 67.65 80.74
    SohuWb 81.00 96.00 N/A 78.31 98.59 97.51 97.71 98.00 97.69
    Average 65.96 95.34 N/A 75.71 90.21 91.81 90.05 89.11 92.30
  • Table 3   Recall for each algorithm in each source (the best sources are marked in bold)
    DataSet CETR CEPR-AT CETD-QT CETD-Jsoup CEPF CEDP-TD CEDP-CTD CEDP-DSum CEDP-NLTD
    (%) (%) (%) (%) (%) (%) (%) (%) (%)
    CleanEval-En 87.20 68.00 94.93 93.18 84.28 83.88 61.98 63.78 83.38
    CleanEval-Zh 87.48 72.67 N/A 90.35 85.40 85.33 68.82 69.05 85.07
    NY Post 91.40 71.43 95.75 95.77 88.07 87.28 87.30 87.33 87.39
    Freep 90.93 90.46 99.22 98.98 96.29 95.76 95.19 96.38 95.65
    Suntimes 97.95 78.25 99.18 98.95 93.44 93.55 94.04 94.03 93.74
    Techweb 97.68 87.31 99.67 99.27 94.77 95.13 91.79 94.47 95.10
    Tribune 95.24 83.62 N/A 98.64 94.12 93.51 93.56 93.50 93.58
    Nytimes 93.40 78.29 96.00 96.64 88.78 87.74 85.91 85.30 87.85
    BBC 92.17 70.88 98.49 98.41 87.41 87.15 86.67 87.03 87.01
    Reuters 92.75 77.01 N/A 99.61 93.41 93.44 76.91 77.49 93.44
    Yahoo 94.22 78.86 97.08 97.70 85.91 88.66 87.62 88.98 88.91
    Sina 98.94 85.54 N/A 98.47 97.55 97.09 96.93 97.16 97.04
    People 97.89 79.12 99.70 99.91 94.64 94.04 93.16 94.65 93.83
    163 88.76 81.46 N/A 89.93 87.24 87.45 87.31 86.81 87.49
    Xinhua 98.68 73.94 99.15 99.25 93.93 93.22 93.12 93.41 93.11
    TunxunWb 98.66 10.25 83.81 85.44 72.34 79.27 76.63 77.21 77.95
    SinaWb 94.56 11.10 N/A 95.70 97.61 94.94 79.98 97.59 91.72
    SohuWb 94.33 89.53 N/A 99.30 88.92 87.47 87.42 87.37 87.42
    Average 94.01 71.54 N/A 96.42 90.23 90.27 85.80 87.31 89.98

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号