Quality Control for Crowd Workers and for Language Models: A Framework for Free-Text Response Evaluation with No Ground Truth
提出AQER框架,通过聚合多个响应生成合成正确答案,无需真实答案即可评估大语言模型和众包工人的自由文本回答质量,帮助管理者选择模型、监控性能并管理众包质量。
As businesses increasingly rely on large language models (LLMs) for tasks such as customer service and information retrieval, ensuring the accuracy of their responses is a critical challenge. Traditional verification is costly, slow, and often requires scarce domain experts. We introduce the automated quality evaluation based on textual responses (AQER) framework, a novel, cost-effective method to assess the correctness of free-text answers from both LLMs and human workers without needing preexisting correct answers. AQER works by intelligently aggregating multiple responses to the same question, leveraging the wisdom of the crowd to create a reliable synthetic correct answer, followed by an iterative procedure that accounts for response quality cues. AQER obtains state-of-the-art performance compared with existing automated response evaluation baselines. For managers AQER offers a scalable, data-driven method to (i) evaluate and select the best performing LLMs for specific organizational needs and use cases, (ii) continuously monitor artificial intelligence (AI) performance to ensure reliability and accountability across different model versions, and (iii) manage the quality of crowd workers essential for high-quality AI training and validation. AQER, thus, offers a robust mechanism for improving model performance and mitigating the significant financial and reputational risks associated with deploying untrustworthy generative AI technologies.