使用平均奖励强化学习求解半马尔可夫决策问题

Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning

Management Science · 1999

被引 215

人大 A+FT50UTD24ABS 4*

Tapas K. Das · 南佛罗里达大学
Abhijit Gosavi · 南佛罗里达大学
Sridhar Mahadevan · 密歇根州立大学
Nicholas Marchalleck

中文导读

提出一种无模型的强化学习算法SMART，用于求解平均奖励准则下的半马尔可夫决策问题，并在生产库存系统的最优预防性维护调度问题上验证了其有效性。

Abstract

A large class of problems of sequential decision making under uncertainty, of which the underlying probability structure is a Markov process, can be modeled as stochastic dynamic programs (referred to, in general, as Markov decision problems or MDPs). However, the computational complexity of the classical MDP algorithms, such as value iteration and policy iteration, is prohibitive and can grow intractably with the size of the problem and its related data. Furthermore, these techniques require for each action the one step transition probability and reward matrices, and obtaining these is often unrealistic for large and complex systems. Recently, there has been much interest in a simulation-based stochastic approximation framework called reinforcement learning (RL), for computing near optimal policies for MDPs. RL has been successfully applied to very large problems, such as elevator scheduling, and dynamic channel allocation of cellular telephone systems. In this paper, we extend RL to a more general class of decision tasks that are referred to as semi-Markov decision problems (SMDPs). In particular, we focus on SMDPs under the average-reward criterion. We present a new model-free RL algorithm called SMART (Semi-Markov Average Reward Technique). We present a detailed study of this algorithm on a combinatorially large problem of determining the optimal preventive maintenance schedule of a production inventory system. Numerical results from both the theoretical model and the RL algorithm are presented and compared.

半马尔可夫决策问题平均奖励强化学习SMART算法

阅读原文 ↗