论机器学习在面板数据中的(误)用

On the (Mis)Use of Machine Learning With Panel Data

Oxford Bulletin of Economics and Statistics · 2025
被引 1 · 同刊同年前 10%
人大 AABS 3

中文导读

首次系统评估面板数据中使用机器学习时的数据泄露问题,揭示忽视截面和时间结构会导致性能高估,并提供实践指南。对使用面板数据做预测的经济学者有警示作用。

Abstract

ABSTRACT We provide the first systematic assessment of data leakage issues in the use of machine learning on panel data. Our organising framework clarifies why neglecting the cross‐sectional and longitudinal structure of these data leads to hard‐to‐detect data leakage, inflated out‐of‐sample performance, and an inadvertent overestimation of the real‐world usefulness and applicability of machine learning models. We then offer empirical guidelines for practitioners to ensure the correct implementation of supervised machine learning in panel data environments. An empirical application, using data from over 3000 U.S. counties spanning 2000 to 2019 and focused on income prediction, illustrates the practical relevance of these points across nearly 500 models for both classification and regression tasks.

数据泄漏面板数据机器学习监督学习