深度强化学习中基于最大散度的最优策略方法

A Maximum Divergence Approach to Optimal Policy in Deep Reinforcement Learning

IEEE Transactions on Cybernetics · 2021

被引 21

ABS 3

Zhiyou Yang
Hong Qu
Mingsheng Fu
Wang Hu
Yongze Zhao

中文导读

研究了最大化散度的马尔可夫决策过程，提出散度演员-评论家算法，通过显式学习状态转移中的内在信息来获得多模态随机策略，在复杂环境中提升了性能和鲁棒性。

Abstract

Model-free reinforcement learning algorithms based on entropy regularized have achieved good performance in control tasks. Those algorithms consider using the entropy-regularized term for the policy to learn a stochastic policy. This work provides a new perspective that aims to explicitly learn a representation of intrinsic information in state transition to obtain a multimodal stochastic policy, for dealing with the tradeoff between exploration and exploitation. We study a class of Markov decision processes (MDPs) with divergence maximization, called divergence MDPs. The goal of the divergence MDPs is to find an optimal stochastic policy that maximizes the sum of both the expected discounted total rewards and a divergence term, where the divergence function learns the implicit information of state transition. Thus, it can provide better-off stochastic policies to improve both in robustness and performance in a high-dimension continuous setting. Under this framework, the optimality equations can be obtained, and then a divergence actor-critic algorithm is developed based on the divergence policy iteration method to address large-scale continuous problems. The experimental results, compared to other methods, show that our approach achieved better performance and robustness in the complex environment particularly. The code of DivAC can be found in https://github.com/yzyvl/DivAC.

强化学习马尔可夫决策过程人工智能机器学习

阅读原文 ↗