学习部分可观测马尔可夫决策过程的次线性遗憾

Sublinear regret for learning POMDPs

Production and Operations Management · 2022

被引 4

人大 AFT50UTD24ABS 4

Yi Xiong · 香港中文大学
Xuefeng Gao · 香港中文大学
Xiang Zhou · 香港中文大学
Ningyuan Chen · 多伦多大学

中文导读

研究了部分可观测马尔可夫决策过程的模型无关强化学习，提出一种算法，在无限时间平均奖励最优策略的oracle下，实现了O(T^(2/3)√logT)的次线性遗憾界，是首个对一般POMDP达到此结果的算法。

Abstract

We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll"> <mml:semantics definitionURL="" encoding=""> <mml:mrow> <mml:mi>O</mml:mi> <mml:mo stretchy="false">(</mml:mo> <mml:msup> <mml:mi>T</mml:mi> <mml:mrow> <mml:mn>2</mml:mn> <mml:mo>/</mml:mo> <mml:mn>3</mml:mn> </mml:mrow> </mml:msup> <mml:msqrt> <mml:mrow> <mml:mi>log</mml:mi> <mml:mi>T</mml:mi> </mml:mrow> </mml:msqrt> <mml:mo stretchy="false">)</mml:mo> </mml:mrow> <mml:annotation encoding="">$O(T^{2/3}\sqrt {\log T})$</mml:annotation> </mml:semantics> </mml:math> for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.

强化学习部分可观测马尔可夫决策过程在线学习遗憾分析

阅读原文 ↗