面向情节贝叶斯马尔可夫决策过程的信息导向策略采样

Information-directed policy sampling for episodic Bayesian Markov decision processes

IISE Transactions · 2024

被引 1

ABS 3

Victoria Diaz
Archis Ghate 通讯

中文导读

针对不完全信息下的有限阶段马尔可夫决策过程，提出信息导向策略采样框架，通过最小化凸信息比来平衡探索与利用，并推导出与状态和动作空间大小无关的遗憾界。

Abstract

We consider finite-stage Markov Decision Processes (MDPs) under incomplete information, where the decision-maker only knows that the true transition probability and reward matrices belong to given, finite sets. The decision-maker interacts with the system over a finite number of episodes. The first episode begins with a probabilistic belief about the true probability and reward matrices. This belief is updated at the end of each episode using observed events. The goal is to maximize the expected total reward earned over all episodes. In the resulting model-based episodic Bayesian MDP, it suffices to only consider (the known) policies that are optimal to each one of the possible probability and reward matrices. Nevertheless, the decision-maker should execute policies that provide information about the true probabilities and rewards (exploration), but also exploit this knowledge to increase rewards. We propose a framework called Information-Directed Policy Sampling (IDPS). In each episode, the decision-maker balances the exploitation-exploration trade-off by executing a randomized policy that minimizes a so-called convex information ratio. We derive a regret bound that is independent of state- and action-space cardinalities when the set of matrices is exogenously determined. Numerical experiments show IDPS outperforming a state-of-the-art approach called Posterior Sampling.

贝叶斯决策马尔可夫决策过程强化学习探索与利用权衡

作者公开的免费版 ↗阅读原文 ↗