Different but the Same? An Event-Driven Approach to Determine Probabilities of Data Duplication
提出一种基于事件驱动的概率方法,通过分析导致重复的事件模式来检测数据重复,在多个数据集上优于现有方法,且模型可在同领域数据集间迁移。
The importance of data quality assessment, in general, and duplicate detection, in particular, has been recognized in both research and practice. Duplicates are known to cause critical problems in many domains, including customer relationship management, data management and data warehousing, fraud detection, production, and healthcare. Such duplicates are caused by events that are typically associated with data patterns in the records. For example, duplicates in customer databases caused by the event “relocation of a customer” characteristically exhibit dissimilar values for address-related features, while the values of the other features (e.g., name-related features) tend to be highly similar. Analyzing duplicate-related events and recognizing such patterns seems particularly promising for duplicate detection. However, existing approaches do not take advantage of this and neither consider events nor recognize their associated data patterns when detecting potential duplicates. In this paper, we introduce events as causes of duplicates and, on this basis, propose a novel probability-based approach for duplicate detection. Our approach assigns the probability of being a duplicate to each analyzed pair of data while avoiding limiting methodical assumptions of existing approaches. The evaluation on seven different datasets shows that our approach is able to outperform existing approaches and that the models learned can be transferred between datasets of the same domain without further labeling or repeated model learning.