大数据获取与整理的特殊考量

Special Considerations for the Acquisition and Wrangling of Big Data

ORGANIZATIONAL RESEARCH METHODS · 2017

被引 46

人大 A-ABS 4

Michael T. Braun · 南佛罗里达大学通讯
Goran Kuljanin · 德保罗大学
Richard P. DeShon · 密歇根州立大学

中文导读

面向组织科学研究者，阐述大数据获取与整理中的独特挑战（如网络爬取、文件分块、多格式处理等），并提供操作指南与R代码示例，帮助确保数据质量与结果可复现。

Abstract

Organizational scientists must capitalize on the big data revolution to better understand the nomothetic, idiographic, multilevel, and/or dynamic processes that make up today’s workplace. Simultaneously, researchers must collect high-quality data and be careful, diligent, and deliberate during data wrangling and data analysis so that all results can be replicated and all inferences are appropriate. Unfortunately, big data create many uncommon challenges during data acquisition and data wrangling that must be considered and overcome to fulfill the promise and potential of big data. Specifically, during acquisition, organizational scientists must become familiar with concepts like web scraping and databases, determine how to divide big data files into manageable chunks for cleaning and analysis, all while ensuring not to violate data usage rules and regulations. Likewise, once acquired, to effectively wrangle data so that they are ready for analysis researchers must be able to handle multiple file formats and data encoding standards, utilize a variety of software to visualize and diagnose data structure, and be adept at using functions and algorithms to determine variable structure and evaluate records and variables for missing or erroneous information. The current article provides a concise definition of big data and addresses each of these novel challenges and concepts related to big data acquisition and wrangling, specifically focusing on providing guidance and recommendations. Finally, a detailed big data example, team development using play-by-play basketball data, is provided. Each step of the process of scraping the data from the web as well as wrangling the multilevel big data into tidy data form is discussed, accompanied by a supplemental R file that contains all of the code necessary for researchers to replicate the described procedure.

组织科学大数据数据科学数据清洗

阅读原文 ↗