分位数马尔可夫决策过程

Quantile Markov Decision Processes

Operations Research · 2021

被引 9

人大 AFT50UTD24ABS 4*

Xiaocheng Li · 斯坦福大学
Huaiyang Zhong · 斯坦福大学
Margaret L. Brandeau · 斯坦福大学

中文导读

研究了以累积奖励的分位数而非期望值为优化目标的马尔可夫决策过程，提出了基于动态规划的最优策略求解算法，并拓展到条件风险价值目标，在HIV治疗启动问题中验证了模型实用性。

Abstract

The goal of a traditional Markov decision process (MDP) is to maximize expected cumulative reward over a defined horizon (possibly infinite). In many applications, however, a decision maker may be interested in optimizing a specific quantile of the cumulative reward instead of its expectation. In this paper we consider the problem of optimizing the quantiles of the cumulative rewards of a Markov decision process (MDP), which we refer to as a quantile Markov decision process (QMDP). We provide analytical results characterizing the optimal QMDP value function and present a dynamic programming-based algorithm to solve for the optimal policy. The algorithm also extends to the MDP problem with a conditional value-at-risk (CVaR) objective. We illustrate the practical relevance of our model by evaluating it on an HIV treatment initiation problem, where patients aim to balance the potential benefits and risks of the treatment.

马尔可夫决策过程分位数优化动态规划风险管理医疗决策

免费全文 ↗阅读原文 ↗