logo

SCIENCE CHINA Information Sciences, Volume 63 , Issue 7 : 172001(2020) https://doi.org/10.1007/s11432-020-2917-1

Unifying logic rules and machine learning for entity enhancing

More info
  • ReceivedApr 1, 2020
  • AcceptedApr 16, 2020
  • PublishedJun 8, 2020

Abstract

This paper proposes a notion of entity enhancing, which unifies entity resolution and conflict resolution, to identify tuples that refer to the same real-world entity and at the same time, correct semantic inconsistencies. We propose to unify rule-based and machine learning (ML) methods for entity enhancing, by embedding ML classifiers as predicates in logic rules. We model entity enhancing by extending the chase. We show that the chase warrants correctness justification and the Church-Rosser property. Moreover, we settle fundamental problems associated with entity enhancing, including the enhancing, consistency, satisfiability, and implication problems, ranging from mathsf NPxspace-complete and mathsf coNPxspace-complete to $\Pi_2^p$-complete. Taken together, these provide a new theoretical framework for unifying entity resolution and conflict resolution.


Acknowledgment

This work was supported in part by Shenzhen Institute of Computing Sciences, Beijing Advanced Innovation Center for Big Data and Brain Computing (Beihang University), Royal Society Wolfson Research Merit Award (Grant No. WRM/R1/180014), European Research Council (Grant No. 652976), Engineering and Physical Sciences Research Council (Grant No. EP/M025268/1).


References

[1] Wikibon. A comprehensive list of big data statistics, 2012. http://wikibon.org/blog/big-data-statistics/. Google Scholar

[2] Fan W, Gao H, Jia X. Dynamic constraints for record matching. VLDB J, 2011, 20: 495-520 CrossRef Google Scholar

[3] Bertossi L, Kolahi S, Lakshmanan L V S. Data Cleaning and Query Answering with Matching Dependencies and Matching Functions. Theor Comput Syst, 2013, 52: 441-482 CrossRef Google Scholar

[4] Bhattacharya I, Getoor L. Collective entity resolution in relational data. ACM Trans Knowl Discov Data, 2007, 1: 5 CrossRef Google Scholar

[5] Arasu A, Ré C, Suciu D. Large-scale deduplication with constraints using Dedupalog. In: Proceedings of the 25th International Conference on Data Engineering, 2009. Google Scholar

[6] Mudgal S, Li H, Rekatsinas T, et al. Deep learning for entity matching: a design space exploration. In: Proceedings of International Conference on Management of Data, 2018. Google Scholar

[7] Arasu A, Götz M, Kaushik R. On active learning of record matching packages. In: Proceedings of International Conference on Management of Data, 2010. Google Scholar

[8] Fan W, Geerts F, Jia X. Conditional functional dependencies for capturing data inconsistencies. ACM Trans Database Syst, 2008, 33: 1-48 CrossRef Google Scholar

[9] Golab L, Karloff H, Korn F, et al. On generating near-optimal tableaux for conditional functional dependencies. In: Proceedings of the VLDB Endowment, 2008. Google Scholar

[10] Fan W, Geerts F, Tang N. Conflict resolution with data currency and consistency. J Data Inf Qual, 2014, 5: 1-37 CrossRef Google Scholar

[11] Arenas M, Bertossi L, Chomicki J. Consistent query answers in inconsistent databases. In: Proceedings of Symposium on Principles of Database Systems, 1999. Google Scholar

[12] Chu X, Ilyas I F, Papotti P. Holistic data cleaning: putting violations into context. In: Proceedings of IEEE International Conference on Data Engineering, 2013. Google Scholar

[13] Chiticariu L, Li Y Y, Reiss F R. Rule-based information extraction is dead Long live rule-based information extraction systems In: Proceedings of Empirical Methods in Natural Language Processing, 2013. Google Scholar

[14] Fan W F, Li J Z, Ma S, et al. Interaction between record matching and data repairing. In: Proceedings of International Conference on Management of Data, 2011. Google Scholar

[15] Dong X, Halevy A, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of International Conference on Management of Data, 2005. Google Scholar

[16] Whang S E, Benjelloun O, Garcia-Molina H. Generic entity resolution with negative rules. VLDB J, 2009, 18: 1261-1277 CrossRef Google Scholar

[17] Sadri F, Ullman J D. The interaction between functional dependencies and template dependencies. In: Proceedings of International Conference on Management of Data, 1980. Google Scholar

[18] Bahmani Z, Bertossi L, Vasiloglou N. ERBlox : Combining matching dependencies with machine learning for entity resolution. Int J Approximate Reasoning, 2017, 83: 118-141 CrossRef Google Scholar

[19] Whang S E, Garcia-Molina H. Joint entity resolution on multiple datasets. VLDB J, 2013, 22: 773-795 CrossRef Google Scholar

[20] Verroios V, Garcia-Molina H, Papakonstantinou Y. Waldo: an adaptive human interface for crowd entity resolution. In: Proceedings of International Conference on Management of Data, 2017. Google Scholar

[21] Firmani D, Saha B, Srivastava D. Online entity resolution using an Oracle. Proc VLDB Endow, 2016, 9: 384-395 CrossRef Google Scholar

[22] Ebraheem M, Thirumuruganathan S, Joty S, et al. Distributed representations of tuples for entity resolution. In: Proceedings of Very Large Data Bases, 2018. Google Scholar

[23] Qian K, Popa L, Sen P. Active learning for large-scale entity resolution. In: Proceedings of Conference on Information and Knowledge Management, 2017. Google Scholar

[24] Zhang D X, Guo L, He X N, et al. A graph-theoretic fusion framework for unsupervised entity resolution. In: Proceedings of the 34th International Conference on Data Engineering, 2018. Google Scholar

[25] Yakout M, Elmagarmid A K, Neville J, et al. Guided data repair. In: Proceedings of Very Large Data Bases, 2011. Google Scholar

[26] He J, Veltri E, Santoro D, et al. Interactive and deterministic data cleaning. In: Proceedings of International Conference on Management of Data, 2016. Google Scholar

[27] Assadi A, Milo T, Novgorodov S. Dance: data cleaning with constraints and experts. In: Proceedings of International Conference on Data Engineering, 2017. Google Scholar

[28] Guo S T, Dong X L, Srivastava D, et al. Record linkage with uniqueness constraints and erroneous values. In: Proceedings of Very Large Data Bases, 2010. Google Scholar

[29] Fan W, Li J, Ma S. Towards certain fixes with editing rules and master data. VLDB J, 2012, 21: 213-238 CrossRef Google Scholar

[30] Fan W, Lu P, Tian C. Deducing certain fixes to graphs. Proc VLDB Endow, 2019, 12: 752-765 CrossRef Google Scholar

[31] Yakout M, Berti-Équille L, Elmagarmid A K. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of International Conference on Management of Data, 2013. 553--564. Google Scholar

[32] Abiteboul S, Hull R, Vianu V. Foundations of Databases. Addison-Wesley, 1995. Google Scholar

[33] Aires J P, Meneguzzi F. Norm conflict identification using deep learning. In: Proceedings of International Conference on Autonomous Agents and Multiagent Systems, 2017. 194--207. Google Scholar

[34] Sycara K P. Machine learning for intelligent support of conflict resolution. Decision Support Syst, 1993, 10: 121-136 CrossRef Google Scholar

[35] Loshin D. Master Data Management. San Francisco: Knowledge Integrity Inc., 2009. Google Scholar

[36] Chandra A K, Merlin P M. Optimal implementation of conjunctive queries in relational data bases. In: Proceedings of Symposium on the Theory of Computing, 1977. Google Scholar

[37] Aggarwal C C. Data Classification: Algorithms and Applications. Boca Raton: CRC Press, 2014. Google Scholar

[38] Fan W F, Geerts F. Foundations of Data Quality Management. San Rafael: Morgan & Claypool Publishers, 2012. Google Scholar

[39] Klug A. On conjunctive queries containing inequalities. J ACM, 1988, 35: 146-160 CrossRef Google Scholar

[40] Baudinet M, Chomicki J, Wolper P. Constraint-Generating Dependencies. J Comput Syst Sci, 1999, 59: 94-115 CrossRef Google Scholar

[41] Beeri C, Bernstein P A. Computational problems related to the design of normal form relational schemas. ACM Trans Database Syst, 1979, 4: 30-59 CrossRef Google Scholar

[42] Rutenburg V. Complexity of generalized graph coloring. In: Proceedings of International Symposium on Mathematical Foundations of Computer Science, 1986. Google Scholar

[43] Schaefer M, Umans C. Completeness in the polynomial-time hierarchy: a compendium. 2002. http://ovid.cs.depaul.edu/documents/phcom.pdf. Google Scholar

  • Table 1  

    Table 1Relation $D_1$ of schema mathsf productxspace

    mathsf tidxspace ${~{\mathsf{~id}}}\xspace$ ${~{\mathsf{~seller}}}\xspace^*$ ${~{\mathsf{~description}}}\xspace^*$ mathsf weightxspace ${~{\mathsf{~type}}}\xspace^*$ mathsf pricexspace ${~{\mathsf{~online\_year}}}\xspace^*$
    $t_1$ $p_1$ $s_1$ MacBook Air $13''$, mid 2019, 1.6 GH Intel i5, 128 GB SSD 1.3 kg computer 8.3K $2019$
    $t_2$ $p_2$ $s_1$ MacBook MVFK2CH/A, 128 GB $1.3~\text{kg}^*$ computer $8.3\text{K}^*$ $2019$
    $t_3$ $p_3$ $s_2$ MacbookAir8,1, i5, 8 GB SDRAM, 128 GB SSD 9.5 kg computer 7.3K $2018$
    $t_4$ $p_4$ $s_1$ MacBook MREE2CH/A, Intel i5, 8 GB LPDDR3, 128 GB $1.3~~\text{kg}^*$ computer 9.6K $2018$
    $t_5$ $p_5$ $s_2$ iMac MRQY2CH/A $27''$, 1 TB 9.5 kg computer 12.9K $2019$
  • Table 2  

    Table 2Relation $D_2$ of schema mathsf shopxspace

    mathsf tidxspace ${~{\mathsf{~id}}}\xspace$ mathsf namexspace ${~{\mathsf{~date\_created}}}\xspace^*$ ${~{\mathsf{~on\_sale\_product}}}\xspace^*$
    $t_6$ $s_1$ Apple official $2015/1/1$ $p_1$
    $t_7$ $s_2$ Apple $2015/1/1$ $p_2$
  • Table 3  

    Table 3Relation $D_3$ of schema mathsf deliveryxspace

    mathsf tidxspace ${~{\mathsf{~id}}}\xspace$ ${~{\mathsf{~item}}}\xspace^*$ ${~{\mathsf{~quantity}}}\xspace^*$ ${~{\mathsf{~city}}}\xspace^*$ mathsf feexspace
    $t_8$ $d_1$ $p_3$5K Beijing $1\text{K}^*$
    $t_9$ $d_2$ $p_5$5K Beijing 3K
  • Table 4  

    Table 4Complexity for entity enhancing and reasoning about mathsf REEsxspace

    Entity enhancing Consistency Satisfiability Implication
    mathsf NPxspace-complete (Theorem 4.2) mathsf coNPxspace-complete (Theorem 4.3) mathsf NPxspace-complete (Theorem 4.5)$\Pi_2^p$-complete (Theorem 4.8)

Copyright 2020 Science China Press Co., Ltd. 《中国科学》杂志社有限责任公司 版权所有

京ICP备17057255号       京公网安备11010102003388号