Reinforcement learning in a prisoner's dilemma
研究了无状态Q学习等强化学习算法在囚徒困境中的极限行为,揭示了学习率与博弈收益如何决定玩家学会合作还是背叛,对算法共谋有启示。
I characterize the outcomes of a class of model-free reinforcement learning algorithms, such as stateless Q-learning, in a prisoner's dilemma. The behavior is studied in the limit as players stop experimenting after sufficiently exploring their options. A closed form relationship between the learning rate and game payoffs reveals whether the players will learn to cooperate or defect. The findings have implications for algorithmic collusion and also apply to asymmetric learners with different experimentation rules.