Inference for big data assisted by small area methods: an application on sustainable development goals sensitivity of enterprises in Italy
提出一种结合网络爬虫非概率样本与概率样本的双稳健估计方法,用于估计意大利各省企业对可持续发展目标的敏感性,并通过蒙特卡洛模拟验证其有效性。
Abstract In this study, we proposed a new method for estimating the sensitivity of enterprises in Italy to the United Nation’s sustainable development goals at the provincial level using web-scraping data (a nonprobability sample) because this value is not surveyed by the Italian National Institute of Statistics. The proposed method used a probability sample to reduce the selection bias of estimates obtained from the nonprobability sample in the context of small area estimation and integrated nonprobability and probability samples using a double robust estimator that combined (i) propensity weighting to improve the representativeness of the nonprobability sample and (ii) a statistical model to predict the units that were not in the nonprobability sample. A bootstrap procedure for estimating variance was also proposed. To validate the proposed method, a Monte Carlo simulation was performed. Results showed that the proposed method allowed the correction of bias from the nonprobability sample while maintaining a good level of estimate reliability.