纠正数据挖掘生成变量回归模型中的分类错误偏差

Correcting Misclassification Bias in Regression Models with Variables Generated via Data Mining

Information Systems Research · 2021

被引 20

人大 AFT50UTD24ABS 4*

Mengke Qiao · 中国科学技术大学
Ke‐Wei Huang · 新加坡国立大学

中文导读

研究了在回归分析中使用数据挖掘方法构建变量时，分类错误导致估计不一致的问题，提出了修正公式以获得一致且最精确的估计量。

Abstract

There is a surge of interest in social science studies in applying data mining methods to construct variables for regression analysis. For example, text classification was applied to classify whether the review is subjective or objective. The derived review subjectivity was used as an independent variable in the regression to examine its impact on review helpfulness. In the classification phase of these studies, researchers need to subjectively choose a classification performance metric for optimization. No matter which performance metric is chosen, the constructed variable still includes classification error because the variable cannot be classified perfectly. The misclassification of constructed variables will lead to inconsistent estimators of regression coefficients in the following phase. To correct the estimation inconsistency, we summarize and modify existing proofs in econometrics to derive theoretical formulas of consistent estimators in generalized linear models. The main implication of our theoretical result is that the inconsistency can be corrected by theoretical formulas, even when the classification accuracy is poor. Therefore, we propose that a classification algorithm should be tuned to minimize the standard error of the focal coefficient derived based on the corrected formula. As a result, researchers derive a consistent and most precise estimator in generalized linear models.

计量经济学机器学习回归分析数据挖掘文本分类

阅读原文 ↗