组织研究中的文本挖掘文本预处理:回顾与建议

Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations

ORGANIZATIONAL RESEARCH METHODS · 2020
被引 349
人大 A-ABS 4

中文导读

回顾了计算语言学和组织文本挖掘研究,为文本预处理决策提供基于实证的建议,考虑文本挖掘类型、研究问题和数据集特征,并强调报告透明度以促进可重复性。

Abstract

Recent advances in text mining have provided new methods for capitalizing on the voluminous natural language text data created by organizations, their employees, and their customers. Although often overlooked, decisions made during text preprocessing affect whether the content and/or style of language are captured, the statistical power of subsequent analyses, and the validity of insights derived from text mining. Past methodological articles have described the general process of obtaining and analyzing text data, but recommendations for preprocessing text data were inconsistent. Furthermore, primary studies use and report different preprocessing techniques. To address this, we conduct two complementary reviews of computational linguistics and organizational text mining research to provide empirically grounded text preprocessing decision-making recommendations that account for the type of text mining conducted (i.e., open or closed vocabulary), the research question under investigation, and the data set’s characteristics (i.e., corpus size and average document length). Notably, deviations from these recommendations will be appropriate and, at times, necessary due to the unique characteristics of one’s text data. We also provide recommendations for reporting text mining to promote transparency and reproducibility.

组织研究文本挖掘自然语言处理研究方法