众包工人与语言模型的质量控制：无真实答案的自由文本响应评估框架

Quality Control for Crowd Workers and for Language Models: A Framework for Free-Text Response Evaluation with No Ground Truth

Information Systems Research · 2025

被引 0

人大 AFT50UTD24ABS 4*

Inbal Yahav · 特拉维夫大学
Anat Goldstein · 阿里尔大学
Tomer Geva · 特拉维夫大学
Shahar Meir · 特拉维夫大学
Onn Shehory · 巴伊兰大学

中文导读

提出AQER框架，通过聚合多个响应生成合成正确答案，无需真实答案即可评估大语言模型和众包工人的自由文本回答质量，帮助管理者选择模型、监控性能并管理众包质量。

Abstract

As businesses increasingly rely on large language models (LLMs) for tasks such as customer service and information retrieval, ensuring the accuracy of their responses is a critical challenge. Traditional verification is costly, slow, and often requires scarce domain experts. We introduce the automated quality evaluation based on textual responses (AQER) framework, a novel, cost-effective method to assess the correctness of free-text answers from both LLMs and human workers without needing preexisting correct answers. AQER works by intelligently aggregating multiple responses to the same question, leveraging the wisdom of the crowd to create a reliable synthetic correct answer, followed by an iterative procedure that accounts for response quality cues. AQER obtains state-of-the-art performance compared with existing automated response evaluation baselines. For managers AQER offers a scalable, data-driven method to (i) evaluate and select the best performing LLMs for specific organizational needs and use cases, (ii) continuously monitor artificial intelligence (AI) performance to ensure reliability and accountability across different model versions, and (iii) manage the quality of crowd workers essential for high-quality AI training and validation. AQER, thus, offers a robust mechanism for improving model performance and mitigating the significant financial and reputational risks associated with deploying untrustworthy generative AI technologies.

质量控制众包大语言模型文本评估人工智能管理

阅读原文 ↗