用大语言模型对求职面试进行评分：评估设计要素、效度研究与最佳实践建议

Scoring employment interviews with large language models: Evaluation design components, validity investigations, and best practice recommendations.

Journal of Applied Psychology · 2026

被引 0

人大 A+FT50ABS 4*

Kayden Stockdale · 弗吉尼亚理工学院暨州立大学
Louis Hickman · 弗吉尼亚理工学院暨州立大学
Siyi Liu · 弗吉尼亚理工学院暨州立大学

中文导读

研究用大语言模型对求职面试进行评分，发现较大较新的模型组合在心理测量属性上可与监督机器学习模型和单个人类评分者媲美或更优，但组织在采用时需谨慎，并提出了最佳实践建议。

Abstract

= 144). We then investigated the LLM scores' intrarater reliabilities, test-retest correlations, convergent, discriminant, and criterion evidence of validity, group differences, and measurement bias. We compared this evidence, when possible, to the same evidence for human raters and supervised machine learning models. The results suggest that ensembles of larger, newer LLMs using prompts with detailed construct information hold potential for scoring employment interviews with psychometric properties comparable to or superior to supervised machine learning models and single human raters. We detail the reasons that organizations may want to be cautious in adopting LLMs for scoring high-stakes open-ended assessments, but since organizations have already begun adopting them, we also offer best practice recommendations. (PsycInfo Database Record (c) 2026 APA, all rights reserved).

人力资源管理心理测量学人工智能应用面试评估

阅读原文 ↗