From Text to Insight: Leveraging Large Language Models for Performance Evaluation in Management
研究检验大语言模型能否可靠地补充人类判断,用于评估基于文本的任务绩效。通过分析744个知识型绩效输出,发现先进AI模型与专家共识的相关性(r=0.62)超过聚合的人类评分(r=0.50),且新模型抗偏差能力更强。
ABSTRACT This study examines whether Large Language Models can serve as reliable supplements to human judgment in evaluating text‐based task performance. Through two studies analyzing 744 knowledge‐based performance outputs, we compare ratings from multiple LLM architectures (GPT‐4, GPT‐5, o3, Claude Sonnet 4, DeepSeek v3) against human evaluators (individual and aggregated ratings), with external expert consensus serving as the validity benchmark for both. Our multi‐model design reveals that various LLMs demonstrate comparable or superior evaluation capabilities relative to human raters, with newer models showing enhanced performance. Using external expert panels as validation criteria, we find that advanced AI models achieve correlations up to r = 0.62 with expert consensus, surpassing aggregated human ratings ( r = 0.50). Different AI systems exhibit higher consistency than human evaluators while showing varying bias resistance: newer models demonstrate minimal susceptibility to halo effects, while earlier models show greater vulnerability (GPT‐4 declining 35.6%). Our findings validate LLMs as reliable supplements to human evaluation, establishing external benchmarking protocols and providing evidence‐based guidance for selecting appropriate models based on evaluation requirements and bias resistance needs.