Online Learning with Sample Selection Bias
研究了推荐系统中因用户不点击导致结果被截断而产生的选择偏差问题,提出了样本选择赌博机算法,结合赫克曼两步估计和乐观原则,实现了接近最优的遗憾率,并在真实捐赠数据上验证了有效性。
Personalized recommendation systems often face the challenge of making optimal decisions when user preferences are unknown, and outcomes are only observed if users engage with the platform (e.g., clicking a recommendation). In their paper, “Online Learning with Sample Selection Bias,” Singhvi and Singhvi study this problem in the context of sequential decision making, where the censoring of outcomes leads to selection bias. Ignoring this bias results in suboptimal recommendations and linear regret, even for well-performing existing learning algorithms. To address this, they propose the sample selection bandit (SSB) algorithm, which combines Heckman’s two-step estimator with the “optimism under uncertainty” principle. The authors also demonstrate that SSB achieves a near-optimal regret rate. Extensive numerical experiments using synthetic and real-world donation data confirm that SSB significantly outperforms existing algorithms, effectively addressing selection bias while improving recommendations and outcomes in practical settings.