🌙

基于事后重标记的离线强化学习自监督模仿

Self-Supervised Imitation for Offline Reinforcement Learning With Hindsight Relabeling

IEEE Transactions on Systems, Man, and Cybernetics: Systems · 2023
被引 6
ABS 3

中文导读

提出一种离线强化学习算法,结合事后重标记和监督回归来预测动作,无需环境中的先知信息,在稀疏奖励和连续控制任务中表现良好。

Abstract

Reinforcement learning (RL) requires a lot of interactions with the environment, which is usually expensive or dangerous in real-world tasks. To address this problem, offline RL considers learning policies from fixed datasets, which is promising in utilizing large-scale datasets, but still suffers from the unstable estimation for out-of-distribution data. Recent developments in RL via supervised learning methods offer an alternative to learning effective policies from suboptimal datasets while relying on oracle information from the environment. In this article, we present an offline RL algorithm that combines hindsight relabeling and supervised regression to predict actions without oracle information. We use hindsight relabeling on the original dataset and learn a command generator and command-conditional policies in a supervised manner, where the command represents the desired return or goal location according to the corresponding task. Theoretically, we illustrate that our method optimizes the lower bound of the goal-conditional RL objective. Empirically, our method achieves competitive performance in comparison with existing approaches in the sparse reward setting and favorable performance in continuous control tasks.

强化学习离线学习自监督学习模仿学习