自动链接方法表现如何?来自美国历史数据的教训

How Well Do Automated Linking Methods Perform? Lessons from US Historical Data

Journal of Economic Literature · 2020
被引 165
人大 A-ABS 4

中文导读

评估了美国历史数据中常用记录链接算法的表现,发现算法链接错误率高达15%-37%,且错误链接会系统性影响分析结果,例如使代际收入弹性估计衰减高达20%。

Abstract

This paper reviews the literature in historical record linkage in the U.S. and examines the performance of widely-used record linking algorithms and common variations in their assumptions. We use two high-quality, hand-linked datasets and one synthetic ground truth to examine the direct effects of linking algorithms on data quality. We find that (1) no algorithm (including hand-linking) consistently produces representative samples; (2) 15 to 37 percent of links chosen by widely-used algorithms are classified as errors by trained human reviewers; and (3) false links are systematically related to baseline sample characteristics, showing that some algorithms may induce systematic measurement error into analyses. A case study shows that the combined effects of (1)-(3) attenuate estimates of the intergenerational income elasticity by up to 20 percent, and common variations in algorithm assumptions result in greater attenuation. As current practice moves to automate linking and increase link rates, these results highlight the important potential consequences of linking errors on inferences with linked data. We conclude with constructive suggestions for reducing linking errors and directions for future research.

自动链接方法历史数据链接链接误差数据质量