Statistical inference in the presence of imputed survey data through regression trees and random forests
研究了在调查数据中,使用回归树和随机森林进行缺失值插补后,如何对总体均值进行点估计和方差估计,并通过模拟评估其偏差、效率和置信区间覆盖率。
Abstract Item nonresponse in surveys is usually handled through some form of imputation. In recent years, imputation through machine learning procedures has attracted a lot of attention in national statistical offices. However, little is known about the theoretical properties of the resulting point estimators in a survey setting. In this article, we study regression trees and random forests that provide flexible tools for obtaining imputed values. In a high‐dimensional framework allowing the number of predictors to diverge, we lay out a set of conditions for establishing the mean square consistency of regression trees and random forests imputed estimators of a finite population mean. We propose a novel variance estimator based on a ‐fold cross‐validation procedure. The proposed point and variance estimation are assessed through a simulation study in terms of bias, efficiency, and coverage rate of normal‐based confidence intervals. Finally, the choice of hyperparameters involved in random forest algorithms is investigated through theoretical and empirical work.