Sublinear regret for learning POMDPs
研究了部分可观测马尔可夫决策过程的模型无关强化学习,提出一种算法,在无限时间平均奖励最优策略的oracle下,实现了O(T^(2/3)√logT)的次线性遗憾界,是首个对一般POMDP达到此结果的算法。
We study the model‐based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method‐of‐moments estimations for hidden Markov models, the belief error control in POMDPs and upper confidence bound methods for online learning. We establish a regret bound of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" display="inline" overflow="scroll"> <mml:semantics definitionURL="" encoding=""> <mml:mrow> <mml:mi>O</mml:mi> <mml:mo stretchy="false">(</mml:mo> <mml:msup> <mml:mi>T</mml:mi> <mml:mrow> <mml:mn>2</mml:mn> <mml:mo>/</mml:mo> <mml:mn>3</mml:mn> </mml:mrow> </mml:msup> <mml:msqrt> <mml:mrow> <mml:mi>log</mml:mi> <mml:mi>T</mml:mi> </mml:mrow> </mml:msqrt> <mml:mo stretchy="false">)</mml:mo> </mml:mrow> <mml:annotation encoding="">$O(T^{2/3}\sqrt {\log T})$</mml:annotation> </mml:semantics> </mml:math> for the proposed learning algorithm where T is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.