logo

SCIENTIA SINICA Informationis, Volume 50, Issue 5: 649-661(2020) https://doi.org/10.1360/SSI-2019-0172

Multi-source relational data fusion

More info
  • ReceivedAug 16, 2019
  • AcceptedSep 30, 2019
  • PublishedApr 23, 2020

Abstract

Focusing on the problem of relational data fusion in the environment with “information isolated island”, this paper presents a multi-sources relational data fusion (MSF) framework. The framework consists of three components: schema matching, entity alignment, and entity fusion. Based on the Hungarian algorithm, we propose an alignment discovery mechanism for the attributes alignment among multi-sources relational data. By extracting the multi-dimensional features of attribute values, we efficiently realized schema matching of multi-sources relational data. To link the tuple pairs from multi-source data, we introduced the diversity sampling strategy and the entity feature extraction approach. These can effectively improve the performance of entity alignment. Finally, linked entities are fused to provide a unified view of data analysis. To verify the usefulness and efficiency of the proposed methods, we implemented a fusion system called Data Puzzle, which is verified with the real public multi-field data. Experimental results demonstrate that the proposed methods can fuse multi-source relational data efficiently with high recall and precision.


Funded by

北京市科技计划(Z171100005117002)

国家自然科学基金(61972403,61872315,61661146001)


References

[1] Doan A H, Halevy A, Ives Z. Principles of data integration. Waltham: Elsevier, 2012. 19--58. Google Scholar

[2] Song X Y, Wang Y H. Data Integration and Application Integration. Beijing: China Water Power Press, 2008. Google Scholar

[3] Pandey Y, Bansal S. Safety check: a semantic web application for emergency management. In: Proceedings of the International Workshop on Semantic Big Data, New York, 2017. 1--6. Google Scholar

[4] Qin Y B, Feng L, Chen Y P, et al. “Intelligent Court” data fusion analysis and integrated application. Big Data, 2019, 3: 35--46. Google Scholar

[5] Du X Y, Chen Y G, Lu W, et al. Data wrangling: a key technique of data governance. Big Data Res, 2019, 5: 13--22. Google Scholar

[6] Alwan A A, Nordin A, Alzeber M, et al. A Survey of schema matching research using database schemas and instances. Int J Adv Computer Sci Applicat, 2017, 8: 102--111. Google Scholar

[7] Bernstein P A, Madhavan J, Rahm E. Generic schema matching, ten years later. In: Proceedings of the 37th International Conference on Very Large Data Bases, Seattle, 2011. 695--701. Google Scholar

[8] Munir S, Khan F, Riaz M A. An instance-based schema matching between opaque database schemas. In: Proceedings of the 4th International Conference on Engineering Technology and Technopreneuship, Kuala Lumpur, 2014. 177--182. Google Scholar

[9] Do H H, Erhard R, COMA: a system for flexible combination of schema matching approaches. In: Proceedings of the 28th International Conference on Very Large Data Bases, Hong Kong, 2002. 610--621. Google Scholar

[10] Zhao H, Ram S. Combining schema and instance information for integrating heterogeneous data sources. Data Knowledge Eng, 2007, 61: 281-303 CrossRef Google Scholar

[11] Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: Proceedings of the 18th International Conference on Data Engineering, Washington, 2002. 117--128. Google Scholar

[12] Alzeber M, Alwan A A, Nordin A, et al. An empirical comparative study of instance-based schema matching. Int J Electr Electron Syst, 2018, 10: 1266-1277. Google Scholar

[13] Li Y, Liu D B, Zhang W M. A method for automatic schema matching using characteristic of data distribution. Computer science, 2005, 32: 85--87. Google Scholar

[14] Mehdi O A, Ibrahim H, Affendey L S. An approach for instance based schema matching with google similarity and regular expression. Int Arab J Inf Technol, 2017, 14(5): 755-763. Google Scholar

[15] Cilibrasi R L, Vitanyi P M B. The Google Similarity Distance. IEEE Trans Knowl Data Eng, 2007, 19: 370-383 CrossRef Google Scholar

[16] Nozaki K, Hochin T, Nomiya H. Semantic schema matching for string attribute with word vectors and its evaluation. Int J Netw Distrib Comput, 2019, 7(3): 100-106. Google Scholar

[17] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations, Scottsdale, 2013. 1--12. Google Scholar

[18] Kuhn H W. The Hungarian method for the assignment problem. Naval Res Logistics, 1955, 2: 83-97 CrossRef Google Scholar

[19] Zhao C, He Y Y. Auto-EM: end-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In: Proceedings of the World Wide Web Conference, San Francisco, 2019. 2413--2424. Google Scholar

[20] Dong X L, Rekatsinas T. Data integration and machine learning: a natural synergy. In: Proceedings of the 2018 International Conference on Management of Data, New York, 2018. 1645--1650. Google Scholar

[21] Dunn H L. Record Linkage. Am J Public Health Nations Health, 1946, 36: 1412-1416 CrossRef Google Scholar

[22] Christen P. Development and user experiences of an open source data cleaning, deduplication and record linkage system. SIGKDD Explor Newsl, 2009, 11: 39-48 CrossRef Google Scholar

[23] Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, 2002. 269--278. Google Scholar

[24] Dong X, Gabrilovich E, Heitz G, et al. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, 2014. 601--610. Google Scholar

[25] Nickel M, Murphy K, Tresp V. A Review of Relational Machine Learning for Knowledge Graphs. Proc IEEE, 2016, 104: 11-33 CrossRef Google Scholar

[26] Xiao C, Wang W, Lin X. Ed-Join. Proc VLDB Endow, 2008, 1: 933-944 CrossRef Google Scholar

[27] Jiang Y, Li G, Feng J. String similarity joins. Proc VLDB Endow, 2014, 7: 625-636 CrossRef Google Scholar

[28] Singh R, Meduri V, Elmagarmid A, et al. Generating concise entity matching rules. In: Proceedings of the 2017 ACM International Conference on Management of Data, New York, 2017. 1635--1638. Google Scholar

[29] Fan W, Jia X, Li J. Reasoning about record matching rules. Proc VLDB Endow, 2009, 2: 407-418 CrossRef Google Scholar

[30] Arasu A, Gotz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, 2010. 783--794. Google Scholar

[31] Singla P, Domingos P. Entity resolution with markov logic. In: Proceedings of the 6th International Conference on Data Mining, Hong Kong, 2006. 572--582. Google Scholar

[32] Zhuang Y, Li G, Zhong Z, et al. Hike: a hybrid human-machine method for entity alignment in large-scale knowledge bases. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 2017. 1917--1926. Google Scholar

[33] Mudgal S, Li H, Rekatsinas T, et al. Deep learning for entity matching: A design space exploration. In: Proceedings of the 2018 International Conference on Management of Data, Houston, 2018. 19--34. Google Scholar

[34] Konda P, Naughton J, Prasad S. Magellan. Proc VLDB Endow, 2016, 9: 1197-1208 CrossRef Google Scholar

[35] Pang J A. Research on web text feature extraction method and its development. Inf Stud: Theor Ap, 2006, 3: 338--340+367. Google Scholar

  • Figure 1

    (Color online) Multi-source relational data fusion framework

  • Figure 2

    Time comparison of schema matching

  • Figure 3

    (Color online) Evaluation of entity alignment results. (a) Precision; (b) recall; (c) F1-score

  • Figure 4

    (Color online) Multi-source relational data fusion platform

  • Table 1   Feature extraction
    Features of string type Features of int/float type
    ItemLetterAverage ItemAverage
    LetterVariationCoefficient ItemVariationCoefficient
    LetterProbabliity ItemProbability
    NormalizedLetterEntropy
  •   

    Algorithm 1 Schema matching algorithm based on Hungarian

    Input:$n$ tables to be aligned: tbs;

    Output:Schema matching results: bestmatches_list;

    1: function multimatch(tbs)

    2: Feature_list $\Leftarrow$ feature_extraction(tbs);//Extract feature

    3: foridx1,idx2 initertools.combinations(range(len(Feature list)), 2)

    4: DisMatrix $\Leftarrow$ Combination(Feature_lis[idx1], Feature_lis[idx2]);//Generate distance matrix

    5: bestmatches $\Leftarrow$ Hungarian(DisMatrix);//Find the best matches with Hungarian

    6: bestmatches list.append(bestmatches);

    7: end for

    8: return bestmatches list.

    9: end function

  •   

    Algorithm 2 Entity matching algorithm based on Magellan

    Input:Two tables to be aligned: tb1, tb2; Matched attribute pairs: schema_correlations;

    Output:fusion_results;

    1: function:Magellan(tb1, tb2, schema correlations);

    2: data1,data2 $\Leftarrow$ ( loaddata(tb1, tb2);

    3: C $\Leftarrow$ ( overlap coefficient_join(data1, data2, schema correlations);//blocking

    4: S $\Leftarrow$ ( diversity sample(C);//sampling

    5: Label $\Leftarrow$ ( load labeled sample(labeled path); // Label candidate pairs in S with 0/1 manually, which means non-matched/matched respectively and load it;

    6: F $\Leftarrow$ ( feature extract(Labeled);

    7: Train, Test $\Leftarrow$ ( split(F);

    8: for match in matchers; //decision tree, random forest, support vector machine $\cdots$

    9: model, result $\Leftarrow$ ( cross validation(match, Train);//training

    10: bestmodel, bestresult $\Leftarrow$ ( compare(model, result);

    11: end for

    12: predictions $\Leftarrow$ ( best model.predict(Test);//evaluation

    13: eval result $\Leftarrow$ ( eval matches(predictions);

    14: candset feature $\Leftarrow$ ( feature extract(C);

    15: alignment results $\Leftarrow$ ( best model.predict(candset feature);//prediction

    16: fusion results $\Leftarrow$ ( formalize(alignment results);//data collation and specification

    17: return fusion results.

    18: end function

  • Table 2   Evaluation of schema matching results
    Evaluating indicator 2 tables 3 tables 4 tables 5 tables 6 tables 7 tables
    Recall (%) 77.2 77.2 77.2 77.1 77.1 77.0
    Precision (%) 27.2 27.2 27.1 27.1 27.1 27.1
    Evaluating indicator 8 tables 9 tables 10 tables 11 tables 12 tables Average
    Recall (%) 76.9 76.8 76.7 76.5 76.4 77.1
    Precision (%) 27.1 27.1 27.0 27.0 27.0 27.1
  • Table 3   Schema matching results of single-source and multi-source data
    Fusion type Source table1 Source table2 Match results Profit
    Single-source l_水务局行政许可 r_水务局行政处罚 l_行政相对人名称 r_行政相对人名称 0.7019
    l_统一社会信用代码 r_统一社会信用代码 0.5454
    l_保健食品生产单位证件信息 r_保健食品企业标准备案信息 l_名称 r_企业名称 0.8501
    Multi-source l_水务局行政许可 r_企业信息 l_统一社会信用代码 r_creditcode 0.8067
    l_行政相对人名称 r_name 0.7079
    l_食品生产企业 r_标准化达标信息 l_企业名称 r_企业名称 0.7169
  • Table 4   Fusion results of multi-source data
    水务局行政许可&水务局行政处罚
    Single-source l_统一社会信用代码 l_行政相对人名称 r_统一社会信用代码 r_行政相对人名称 Incomplete match reasons
    91110105M****MYD5M 北京伯佳诊所有限公司 91110105M****MYD5M 北京嘉树广渠诊所有限公司 Changed company name
    4****289-0 北京市建筑工程研究院有限责任公司 911101084****2890C 北京市建筑工程研究院有限责任公司 Inconsistent entries of credit code
    91110105****4838H 北京招商局物业管理有限公司招商局大厦 911101057****4838H 北京招商局物业管理有限公司 Incorrect entries of credit code and inconsistent entries of company name
    保健食品生产单位证件信息&保健食品企业标准备案信息
    l_名称 r_企业名称 Incomplete match reasons
    北京中研万通科技有限责任公司保健食品厂 北京中研万通科技中心 Inconsistent entries of company name
    华润紫竹药业有限公司 北京紫竹药业有限公司
    Multi-source 水务局行政许可&企业信息
    l_统一社会信用代码 l_行政相对人名称 r_creditCode r_name Incomplete match reasons
    911101067****5204Q 北京八方达客运有限责任公司保修分公司 911101067****5204Q 北京公共交通控股(集团)有限公司第二保修分公司 Changed company name
    91110006****2434H 东方艺术大厦有限公司 911100006****2434H 东方艺术大厦有限公司 Inconsistent entries of credit code
    911101056****9797T 北京湄洲三苏大酒楼有限公司 911101056****3797T 北京湄州三苏大酒楼有限公司 Incorrect entries of credit code and inconsistent entries of company name
    食品生产企业&标准化达标信息
    l_企业名称 r_企业名称 Incomplete match reasons
    北京航天农业生物科技有限公司 北京首诚航天农业生物科技有限公司 Inconsistent entries of company name
    宝健中国有限公司 宝健中国日用品有限公司

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备18024590号-1       京公网安备11010102003388号