基于演示和好奇心的策略梯度算法

Policy Gradient From Demonstration and Curiosity

IEEE Transactions on Cybernetics · 2022

被引 13

ABS 3

Jie Chen
Wenjun Xu

中文导读

提出一种集成策略梯度算法，通过引入Jensen-Shannon散度和不确定性估计两项内在奖励，仅用少量专家演示即可提升稀疏奖励环境下的探索效率和学习表现。

Abstract

With reinforcement learning, an agent can learn complex behaviors from high-level abstractions of the task. However, exploration and reward shaping remain challenging for existing methods, especially in scenarios where extrinsic feedback is sparse. Expert demonstrations have been investigated to solve these difficulties, but a tremendous number of high-quality demonstrations are usually required. In this work, an integrated policy gradient algorithm is proposed to boost exploration and facilitate intrinsic reward learning from only a limited number of demonstrations. We achieved this by reformulating the original reward function with two additional terms, where the first term measured the Jensen-Shannon divergence between current policy and the expert's demonstrations, and the second term estimated the agent's uncertainty about the environment. The presented algorithm was evaluated by a range of simulated tasks with sparse extrinsic reward signals, where only limited demonstrated trajectories were provided to each task. Superior exploration efficiency and high average return were demonstrated in all tasks. Furthermore, it was found that the agent could imitate the expert's behavior and meanwhile sustain high return.

强化学习探索策略模仿学习内在奖励

阅读原文 ↗