具有记录特定不一致参数的快速贝叶斯记录链接

Fast Bayesian Record Linkage With Record-Specific Disagreement Parameters

Journal of Business & Economic Statistics · 2021
被引 1
人大 AABS 4

中文导读

提出一种贝叶斯自动概率记录链接方法,在匹配军事征兵数据与1900年美国人口普查时,相比同类方法多恢复50%以上的真实匹配,且计算可行。

Abstract

Researchers are often interested in linking individuals between two datasets\nthat lack a common unique identifier. Matching procedures often struggle to\nmatch records with common names, birthplaces or other field values.\nComputational feasibility is also a challenge, particularly when linking large\ndatasets. We develop a Bayesian method for automated probabilistic record\nlinkage and show it recovers more than 50% more true matches, holding accuracy\nconstant, than comparable methods in a matching of military recruitment data to\nthe 1900 US Census for which expert-labelled matches are available. Our\napproach, which builds on a recent state-of-the-art Bayesian method, refines\nthe modelling of comparison data, allowing disagreement probability parameters\nconditional on non-match status to be record-specific in the smaller of the two\ndatasets. This flexibility significantly improves matching when many records\nshare common field values. We show that our method is computationally feasible\nin practice, despite the added complexity, with an R/C++ implementation that\nachieves significant improvement in speed over comparable recent methods. We\nalso suggest a lightweight method for treatment of very common names and show\nhow to estimate true positive rate and positive predictive value when true\nmatch status is unavailable.\n

贝叶斯记录链接记录级差异参数概率匹配大规模数据匹配