Do Humans and GAI See Eye to Eye? Implications of LLM Scoring Volatility in Supplier Evaluations
研究对比生成式AI与人类专家在政府供应商评估中的表现,发现AI在合规性评估上稳定且与人类一致,但在竞争性信号评估中波动大,需人类介入。
ABSTRACT This study compares Generative Artificial Intelligence (GAI) to human procurement professionals on supplier evaluation tasks. Using Structural Topic Modeling (STM) on 123 government supplier bids from 31 projects solicited by the State of Ohio between January 2023 and December 2024, we compare evaluations from three reasoning models (o3, Grok‐3‐Mini, DeepSeek R1‐0528) against human evaluators. Adopting a signaling theory perspective, we find asymmetry in signal processing between GAI and human evaluators. GAI demonstrates high consistency and strong human alignment when evaluating compliance signals (e.g., technical specifications), which makes it suitable for qualification screening. However, GAI exhibits high scoring volatility with competitive signals (e.g., value‐add propositions), indicating that human judgment remains critical for assessing differentiation. We also find that the number of bidders influences signal composition, with compliance signals more prevalent in less competitive solicitations. The findings suggest a two‐stage evaluation framework where GAI handles compliance screening and humans focus on competitive assessment. GAI scoring volatility serves as a canary‐in‐the‐mine to identify when human oversight is necessary.