Softmax策略梯度方法可能需要指数时间才能收敛

Softmax policy gradient methods can take exponential time to converge

Mathematical Programming · 2023

被引 8

ABS 4

Gen Li
Yuting Wei
Yuejie Chi
Yuxin Chen 通讯

中文导读

该论文证明，即使使用精确梯度，Softmax策略梯度方法在表格型马尔可夫决策过程中也可能需要指数级迭代次数才能收敛，这取决于状态空间大小和有效视界。

Abstract

Abstract The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For $$\gamma $$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>γ</mml:mi></mml:math> -discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space $${\mathcal {S}}$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>S</mml:mi></mml:math> and the effective horizon $$\frac{1}{1-\gamma }$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>γ</mml:mi></mml:mrow></mml:mfrac></mml:math> , both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize $$\eta $$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>η</mml:mi></mml:math> can take $$\begin{aligned} \frac{1}{\eta } |{\mathcal {S}}|^{2^{\Omega \big (\frac{1}{1-\gamma }\big )}} ~\text {iterations} \end{aligned}$$ <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mtable><mml:mtr><mml:mtd><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mi>η</mml:mi></mml:mfrac><mml:msup><mml:mrow><mml:mo>|</mml:mo><mml:mi>S</mml:mi><mml:mo>|</mml:mo></mml:mrow><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mi>Ω</mml:mi><mml:mrow><mml:mo>(</mml:mo></mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>γ</mml:mi></mml:mrow></mml:mfrac><mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:msup><mml:mspace/><mml:mtext>iterations</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math> to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.

强化学习策略梯度马尔可夫决策过程收敛性分析

阅读原文 ↗