参差不齐的能力:衡量生成式人工智能在学术研究中的可靠性

Jagged competencies: Measuring the reliability of generative AI in academic research

JOURNAL OF BUSINESS RESEARCH · 2025
被引 3
人大 A-ABS 3

中文导读

研究了ChatGPT、Llama和Mistral三个大语言模型在十五周内对同一数据语料使用相同提示时的一致性、准确性及其交互作用,发现可靠性差异显著,但在特定约束下可表现出确定性行为,为管理学者负责任地使用LLM提供指导。

Abstract

Large Language Models (LLMs) are increasingly viewed as a valuable tool for academic research. While LLMs have some benefits, a ‘crisis of replicability’ in management scholarship mitigates against unrestrained use. In this paper we investigate the reproducibility of LLM analyses. We analyze three LLMs—ChatGPT, Llama and Mistral—over fifteen weeks, testing the consistency, accuracy and their interaction using the same prompts on the same data corpus. While our results demonstrate significant variations in reliability and consistency across the three LLMs, we also show that LLMs can exhibit deterministic and reliable behavior under specific, well-defined constraints. We argue that replicable LLM-based research will rely on understanding and validating the task-specific operational boundaries of the LLM. To ensure the responsible integration of LLMs into management research, we highlight a need for robust frameworks, transparency, ethical guidelines, and ongoing evaluation. We conclude with actionable guidance for management researchers.

管理学研究方法人工智能学术研究