解决机器学习方法中的样本选择偏差问题

Addressing sample selection bias for machine learning methods

Journal of Applied Econometrics · 2024
被引 1
人大 AABS 3

中文导读

研究了当训练样本与预测样本在未观测维度上存在差异时,如何调整机器学习方法,提出了两种控制函数方法在训练前消除选择偏差的影响,并通过模拟和选举数据验证其降低预测误差的效果。

Abstract

Summary We study approaches for adjusting machine learning methods when the training sample differs from the prediction sample on unobserved dimensions. The machine learning literature predominately assumes selection only on observed dimensions. Common approaches are to weight or include variables that influence selection as solutions to selection on observables. Simulation results show that selection on unobservables increases mean squared prediction error using popular machine‐learning algorithms. Common machine learning practices such as weighting or including variables that influence selection into the training or prediction sample often worsen sample selection bias. We propose two control function approaches that remove the effects of selection bias before training and find that they reduce mean‐squared prediction error in simulations. We apply these approaches to predicting the vote share of the incumbent in gubernatorial elections using previously observed re‐election bids. We find that ignoring selection on unobservables leads to substantially higher predicted vote shares for the incumbent than when the control function approach is used.

样本选择偏差机器学习控制函数法预测误差