多臂赌博机中的策略

Strategies in the multi-armed bandit

Experimental Economics · 2025

被引 1

人大 A-ABS 3

Stanton Hudja · 伊利诺伊理工学院
D. C. Woods 通讯

中文导读

通过实验分析个体在多臂赌博机问题中的行为，发现多数人符合概率性“赢留输变”或强化学习策略，但存在违背假设的情况，进而设计了两种新策略并验证其稳健性。

Abstract

Abstract This paper analyzes individual behavior in multi-armed bandit problems. We use a between-subjects experiment to implement four bandit problems that vary based on the horizon (indefinite or finite) and the number of bandit arms (two or three). We analyze commonly suggested strategies and find that an overwhelming majority of subjects are best fit by either a probabilistic “win-stay lose-shift” strategy or reinforcement learning. However, we show that subjects violate the assumptions of the probabilistic win-stay lose-shift strategy as switching depends on more than the previous outcome. We design two new “biased” strategies that adapt either reinforcement learning or myopic quantal response by incorporating a bias toward choosing the previous arm. We find that a majority of subjects are best fit by one of these two strategies but also find heterogeneity in subjects’ best-fitting strategies. We show that the performance of our biased strategies is robust to adapting popular strategies from other literatures (e.g., EWA and I-SAW) and using different selection criteria. Additionally, we find that our biased strategies best fit a majority of subjects when analyzing a new treatment with a new set of subjects.

多臂老虎机问题赢留输换策略强化学习个体决策

阅读原文 ↗