一种针对机器学习生成变量的可靠统计推断的鲁棒优化方法

A Robust Optimization Approach to Reliable Statistical Inference with Variables Generated by Machine Learning

Information Systems Research · 2025

被引 1

人大 AFT50UTD24ABS 4*

Aaron Schecter · 佐治亚大学
Weifeng Li · 佐治亚大学

中文导读

提出一种鲁棒优化方法，帮助分析师在使用机器学习生成的变量时减少预测误差对统计推断的扭曲，提升假设检验的可靠性，并通过少量高质量标注数据进一步校正，适用于营销、运营、公共政策等领域。

Abstract

Organizations increasingly use machine learning to turn text, images, and other unstructured data into variables that inform decisions and research. But, because machine learning predictions are never perfect, the resulting data can contain errors that quietly distort statistical analyses, sometimes leading to incorrect conclusions about what truly drives important outcomes. This study introduces a robust optimization approach that helps analysts and decision makers draw more reliable insights when working with machine learning–generated data. The method is designed to strengthen the signal of real effects, reducing the influence of noisy or imperfect predictions, resulting in more trustworthy hypothesis tests and fewer missed or misleading findings. The approach also includes a simple correction step that uses a small amount of high-quality labeled data—such as a subset of manually reviewed cases—to further improve accuracy. Across simulations and a real-world example using Amazon reviews, the method consistently delivers more dependable results than common alternatives. For professionals who rely on machine learning in areas such as marketing, operations, public policy, or risk management, this framework offers a practical, transparent way to ensure that conclusions remain sound even when data sources are imperfect.

机器学习统计推断鲁棒优化因果推断数据质量

阅读原文 ↗