风险敏感指数成本马尔可夫决策过程中修正策略迭代的收敛性

On the Convergence of Modified Policy Iteration in Risk-Sensitive Exponential Cost Markov Decision Processes

Operations Research · 2025

被引 0

人大 AFT50UTD24ABS 4*

Yashaswini Murthy · 加州理工学院
Mehrdad Moharrami · 爱荷华大学
R. Srikant · 伊利诺伊大学厄巴纳-香槟分校

中文导读

研究了风险敏感马尔可夫决策过程中修正策略迭代算法的收敛性，证明了其收敛性和有限时间保证，为兼顾计算效率与鲁棒性的强化学习方法提供了理论基础。

Abstract

Balancing Risk and Robustness in Dynamic Decision Making Many real systems, such as networks, finance, and safety-critical autonomy, must hedge against rare but costly events. Risk-sensitive control formalizes this idea by optimizing an exponential cost objective that prioritizes reliability over just average performance. Classical dynamic programming methods such as value iteration and policy iteration are well-understood in this risk-sensitive setting. However, modified policy iteration (MPI), which combines the strengths of both through partial policy evaluation, has lacked any theoretical understanding. This paper addresses this gap. It analyzes MPI for risk-sensitive Markov decision processes governed by a multiplicative Bellman equation, develops normalization and contraction tools suited to this setting, and proves both convergence and finite-time guarantees. The results provide a principled foundation for algorithms that combine computational efficiency with robustness, supporting the development of reinforcement learning methods that emphasize long-term reliability.

马尔可夫决策过程动态规划强化学习风险敏感控制指数成本

阅读原文 ↗