On the Convergence of Modified Policy Iteration in Risk-Sensitive Exponential Cost Markov Decision Processes
研究了风险敏感马尔可夫决策过程中修正策略迭代算法的收敛性,证明了其收敛性和有限时间保证,为兼顾计算效率与鲁棒性的强化学习方法提供了理论基础。
Balancing Risk and Robustness in Dynamic Decision Making Many real systems, such as networks, finance, and safety-critical autonomy, must hedge against rare but costly events. Risk-sensitive control formalizes this idea by optimizing an exponential cost objective that prioritizes reliability over just average performance. Classical dynamic programming methods such as value iteration and policy iteration are well-understood in this risk-sensitive setting. However, modified policy iteration (MPI), which combines the strengths of both through partial policy evaluation, has lacked any theoretical understanding. This paper addresses this gap. It analyzes MPI for risk-sensitive Markov decision processes governed by a multiplicative Bellman equation, develops normalization and contraction tools suited to this setting, and proves both convergence and finite-time guarantees. The results provide a principled foundation for algorithms that combine computational efficiency with robustness, supporting the development of reinforcement learning methods that emphasize long-term reliability.