熵正则化自然策略梯度方法的快速全局收敛

Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization

Operations Research · 2021

被引 49 · 同刊同年前 8%

人大 AFT50UTD24ABS 4*

Shicong Cen · 卡内基梅隆大学
Cheng Chen · 斯坦福大学
Yuxin Chen · 普林斯顿大学
Yuting Wei · 宾夕法尼亚大学
Yuejie Chi · 卡内基梅隆大学

中文导读

研究了熵正则化自然策略梯度方法在软最大化参数化下的非渐近收敛性，证明算法以与状态-动作空间维度无关的线性速率收敛，且对策略评估的不精确性具有稳定性。

Abstract

Preconditioning and Regularization Enable Faster Reinforcement Learning Natural policy gradient (NPG) methods, in conjunction with entropy regularization to encourage exploration, are among the most popular policy optimization algorithms in contemporary reinforcement learning. Despite the empirical success, the theoretical underpinnings for NPG methods remain severely limited. In “Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization”, Cen, Cheng, Chen, Wei, and Chi develop nonasymptotic convergence guarantees for entropy-regularized NPG methods under softmax parameterization, focusing on tabular discounted Markov decision processes. Assuming access to exact policy evaluation, the authors demonstrate that the algorithm converges linearly at an astonishing rate that is independent of the dimension of the state-action space. Moreover, the algorithm is provably stable vis-à-vis inexactness of policy evaluation. Accommodating a wide range of learning rates, this convergence result highlights the role of preconditioning and regularization in enabling fast convergence.

强化学习策略优化马尔可夫决策过程收敛性分析

阅读原文 ↗