多数据源属性值协调框架

A Framework for Reconciling Attribute Values from Multiple Data Sources

Management Science · 2007
被引 23
人大 A+FT50UTD24ABS 4*

中文导读

针对同一实体在不同数据库中的属性值不一致问题,提出一个基于概率分布和错误成本最小化的框架,帮助选择最优值并集成多数据源。

Abstract

Because of the heterogeneous nature of different data sources, data integration is often one of the most challenging tasks in managing modern information systems. While the existing literature has focused on problems such as schema integration and entity identification, it has largely overlooked a basic question: When an attribute value for a real-world entity is recorded differently in different databases, how should the “best” value be chosen from the set of possible values? This paper provides an answer to this question. We first show how a probability distribution over a set of possible values can be derived. We then demonstrate how these probabilities can be used to solve a given decision problem by minimizing the total cost of type I, type II, and misrepresentation errors. Finally, we propose a framework for integrating multiple data sources when a single “best” value has to be chosen and stored for every attribute of an entity.

数据集成属性值冲突概率分布决策成本