超越折扣回报：具有平均最优性和布莱克威尔最优性的鲁棒马尔可夫决策过程

Beyond Discounted Returns: Robust Markov Decision Processes with Average and Blackwell Optimality

Operations Research · 2026

被引 0

人大 AFT50UTD24ABS 4*

Julien Grand-Clément · 巴黎高等商学院
Marek Petrik · 新罕布什尔大学
Nicolas Vieille · 巴黎高等商学院

中文导读

研究了鲁棒马尔可夫决策过程在平均回报和布莱克威尔最优性下的性质，发现sa-矩形模型下平均最优策略可静态确定，而s-矩形模型下可能不存在或非静态，并讨论了算法与随机博弈的联系。

Abstract

Novel Insights on Robust Markov decision Processes with Average Reward and Blackwell Optimality Criteria Robust Markov decision processes (RMDPs) have been studied extensively when the objective is the discounted return, but little is known for average optimality and Blackwell optimality. We show that average optimal policies can be chosen stationary and deterministic for sa-rectangular RMDPs, but perhaps surprisingly, we show that for s-rectangular RMDPs average optimal policies may not exist, and if they do exist, they may not be stationary. We also study Blackwell optimality for sa-rectangular RMDPs, where we show that approximately Blackwell optimal policies always exist, although exact Blackwell optimal policies may not exist. We provide a general sufficient condition for their existence. We then discuss the connection between average and Blackwell optimality, and we describe several algorithms to compute the optimal average return. Interestingly, our approach leverages the connections between RMDPs and stochastic games. Overall, our paper emphasizes the superior practical properties of distance-based sa-rectangular models over s-rectangular models for average and Blackwell optimality.

鲁棒马尔可夫决策过程平均最优性布莱克威尔最优性随机博弈

阅读原文 ↗