TeViR: Text-to-Video Reward With Diffusion Models for Efficient Reinforcement Learning
提出TeViR方法,利用预训练的文本到视频扩散模型生成密集奖励,通过比较预测图像序列与当前观测来提升强化学习样本效率,在13个仿真和真实机器人任务中优于传统方法。
Developing scalable and generalizable reward engineering for reinforcement learning (RL) is crucial for creating general-purpose agents, especially in the challenging domain of robotic manipulation. While recent advances in reward engineering with vision–language models (VLMs) have shown promise, their sparse reward nature significantly limits sample efficiency. This article introduces text-to-video reward (TeViR), a novel method that leverages a pretrained text-to-video diffusion model to generate dense rewards by comparing the predicted image sequence with current observations. Experimental results across 13 simulation and real-world robotic tasks demonstrate that TeViR outperforms traditional methods leveraging sparse rewards and other state-of-the-art (SOTA) methods, achieving better sample efficiency and performance without ground truth environmental rewards. TeViR’s ability to efficiently guide agents in complex environments highlights its potential to advance RL applications in robotic manipulation.