平均奖励马尔可夫决策过程的随机一阶方法

Stochastic First-Order Methods for Average-Reward Markov Decision Processes

Mathematics of Operations Research · 2024

被引 2

ABS 3

Tianjiao Li
Feiyang Wu
Guanghui Lan

中文导读

研究了平均奖励马尔可夫决策过程的策略优化与评估，提出随机策略镜像下降和方差缩减时间差分方法，给出收敛保证和样本复杂度结果，适用于大状态动作空间。

Abstract

We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation. Compared with intensive research efforts in finite sample analysis of policy gradient methods for discounted MDPs, existing studies on policy gradient methods for AMDPs mostly focus on regret bounds under restrictive assumptions, and they often lack guarantees on the overall sample complexities. Toward this end, we develop an average-reward stochastic policy mirror descent method for solving AMDPs with and without regularizers and provide convergence guarantees in terms of the long-term average reward. For policy evaluation, existing on-policy methods suffer from suboptimal convergence rates as well as failure in handling insufficiently random policies due to the lack of exploration in the action space. To remedy these issues, we develop a variance-reduced temporal difference (VRTD) method with linear function approximation for randomized policies along with optimal convergence guarantees, and design an exploratory VRTD method that resolves the exploration issue and provides comparable convergence guarantees. By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models. It is worth noting that when linear function approximation is utilized, our algorithm only needs to update in the low-dimensional parameter space, and thus can handle MDPs with large state and action spaces. Funding: T. Li and G. Lan were supported in part by the National Science Foundation [Grant CCF-1909298] and the Office of Navel Research [Grant N00014-20-1-2089].

强化学习马尔可夫决策过程随机优化策略梯度时间差分学习

阅读原文 ↗