语言偏见会影响生成式AI对人文学科与社会科学学术评价吗？一项基于中文HSS论文的混合方法研究

Does language bias GenAI academic evaluation in humanities and social sciences? A mixed‐methods study based on Chinese‐language HSS papers

Journal of the Association for Information Science and Technology (JASIST) · 2026

被引 0 · 同刊同年前 9%

ABS 3

Yu Zhu 通讯
Yujie Jia
Yumeng Zhu 通讯
Jiyuan Ye

中文导读

研究了GPT-4o和DeepSeek-V3在评价中英文论文时是否存在语言偏见，发现两种模型存在相反方向的偏见，且分数与评价理由脱节。

Abstract

Abstract As generative AI (GenAI) systems are increasingly deployed in cross‐language research evaluation, whether GenAI evaluates multilingual scholarship without language‐induced bias remains unclear. This study examines language bias patterns in GenAI evaluation of humanities and social sciences (HSS) research across models and disciplines. Using a within‐subjects design, 1150 expert‐selected papers from 23 disciplines were evaluated by GPT‐4o and DeepSeek‐V3 in Chinese and English. Results reveal opposite language biases depending on model type: GPT‐4o favors English (Cohen's d = 1.10), while DeepSeek‐V3 favors Chinese (Cohen's d = −0.87), persisting across all disciplines. Thematic analysis reveals a systematic decoupling between scores and evaluative reasoning: both models generate more critical comments for English papers, yet arrive at opposite scores through different rhetorical strategies—GPT‐4o tends to moderate its positive assessments of Chinese papers while DeepSeek‐V3 amplifies them. This decoupling suggests that bias is embedded in the multi‐layered pathways through which models generate and aggregate evaluations. This study provides controlled evidence that language bias in GenAI evaluation is bidirectional and model‐dependent, with scores not directly reflecting evaluative justifications. The findings have implications for designing fairer multilingual academic evaluation systems and for understanding the limitations of GenAI as scholarly evaluation infrastructure.

生成式AI学术评价语言偏见人文学科与社会科学混合方法

阅读原文 ↗