Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback
针对每个玩家只能观察到自身收益(而非梯度)的强单调博弈,提出一种使用自和谐障碍函数的强盗学习算法,同时实现单智能体最优遗憾和多智能体学习到纳什均衡的最优最后迭代收敛速度,并通过数值实验验证了有效性。
Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback Curious about how players can learn and adapt in unknown games without knowing the game’s dynamics? In “Doubly Optimal No-Regret Online Learning in Strongly Monotone Games with Bandit Feedback,” Ba, Lin, Zhang, and Zhou present a novel bandit learning algorithm for no-regret learning in games where each player only observes its reward determined by all players’ current joint action, not its gradient. Focusing on smooth and strongly monotone games, they introduce a bandit learning algorithm using self-concordant barrier functions. This algorithm achieves optimal single-agent regret and optimal last-iterate convergence rate in multiagent learning to the Nash equilibrium. Their work significantly improves previous methods and demonstrates the algorithm’s effectiveness through numerical results in various applications.