Debiasing ML- or AI-Generated Regressors in Partially Linear Models
针对AI生成变量作为回归元带来的测量误差偏差,提出新估计量,仅需少量人工标注子样本即可在部分线性模型中实现无偏高效估计,适用于腾讯、亚马逊等平台的实验系统。
Organizations increasingly use machine learning (ML) and artificial intelligence (AI), including large language models, to generate variables for regression models that inform business and policy decisions. For example, practitioners may use AI to predict review sentiment, ad aesthetics, or emotional expressions, and then estimate their causal effects on outcomes such as sales or engagement. However, because AI predictions are imperfect, directly using these AI-generated variables as regressors introduces measurement error that can systematically bias causal estimates, potentially leading to over- or underinvestment in business strategies. We develop new estimators that correct this bias in partially linear regression models, which are widely deployed in experimental systems at major platforms, including Tencent, Amazon AWS, and Microsoft. Our approach requires only a small human-annotated subsample alongside the large AI-labeled data set to achieve unbiased and efficient estimation. We demonstrate that our methods work with both traditional ML algorithms and LLM-based predictions. Our framework can be directly integrated into existing analytics and experimental systems, enabling practitioners to leverage the scalability of AI-generated data while maintaining reliable causal conclusions. This work also has implications for AI fairness, as our approach can help correct biases from any source in AI predictions.