🌙

连续时间情节马尔可夫决策过程的平方根遗憾界

Square-Root Regret Bounds for Continuous-Time Episodic Markov Decision Processes

Mathematics of Operations Research · 2025
被引 1
ABS 3

中文导读

研究了有限时段情节设置下连续时间马尔可夫决策过程的强化学习,提出基于价值迭代和置信上界的学习算法,推导出遗憾的平方根阶上下界,并通过仿真验证算法性能。

Abstract

We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. In contrast to discrete-time MDPs, the intertransition times of a continuous-time MDP are exponentially distributed with rate parameters depending on the state–action pair at each transition. We present a learning algorithm based on the methods of value iteration and upper confidence bound. We derive an upper bound on the worst case expected regret for the proposed algorithm and establish a worst case lower bound with both bounds of the order of square root on the number of episodes. Finally, we conduct simulation experiments to illustrate the performance of our algorithm. Funding: X. Gao is supported by the Hong Kong Research Grant Council [Grants 14201421, 14212522, 14200123]. X. Zhou gratefully acknowledges financial support through the Nie Center for Intelligent Asset Management at Columbia.

强化学习连续时间马尔可夫决策过程遗憾界价值迭代