Policy Gradient From Demonstration and Curiosity
提出一种集成策略梯度算法,通过引入Jensen-Shannon散度和不确定性估计两项内在奖励,仅用少量专家演示即可提升稀疏奖励环境下的探索效率和学习表现。
With reinforcement learning, an agent can learn complex behaviors from high-level abstractions of the task. However, exploration and reward shaping remain challenging for existing methods, especially in scenarios where extrinsic feedback is sparse. Expert demonstrations have been investigated to solve these difficulties, but a tremendous number of high-quality demonstrations are usually required. In this work, an integrated policy gradient algorithm is proposed to boost exploration and facilitate intrinsic reward learning from only a limited number of demonstrations. We achieved this by reformulating the original reward function with two additional terms, where the first term measured the Jensen-Shannon divergence between current policy and the expert's demonstrations, and the second term estimated the agent's uncertainty about the environment. The presented algorithm was evaluated by a range of simulated tasks with sparse extrinsic reward signals, where only limited demonstrated trajectories were provided to each task. Superior exploration efficiency and high average return were demonstrated in all tasks. Furthermore, it was found that the agent could imitate the expert's behavior and meanwhile sustain high return.