Beyond Discounted Returns: Robust Markov Decision Processes with Average and Blackwell Optimality
研究了鲁棒马尔可夫决策过程在平均回报和布莱克威尔最优性下的性质,发现sa-矩形模型下平均最优策略可静态确定,而s-矩形模型下可能不存在或非静态,并讨论了算法与随机博弈的联系。
Novel Insights on Robust Markov decision Processes with Average Reward and Blackwell Optimality Criteria Robust Markov decision processes (RMDPs) have been studied extensively when the objective is the discounted return, but little is known for average optimality and Blackwell optimality. We show that average optimal policies can be chosen stationary and deterministic for sa-rectangular RMDPs, but perhaps surprisingly, we show that for s-rectangular RMDPs average optimal policies may not exist, and if they do exist, they may not be stationary. We also study Blackwell optimality for sa-rectangular RMDPs, where we show that approximately Blackwell optimal policies always exist, although exact Blackwell optimal policies may not exist. We provide a general sufficient condition for their existence. We then discuss the connection between average and Blackwell optimality, and we describe several algorithms to compute the optimal average return. Interestingly, our approach leverages the connections between RMDPs and stochastic games. Overall, our paper emphasizes the superior practical properties of distance-based sa-rectangular models over s-rectangular models for average and Blackwell optimality.