PCDT：用于离线强化学习的悲观评论家决策变换器

PCDT: Pessimistic Critic Decision Transformer for Offline Reinforcement Learning

IEEE Transactions on Systems, Man, and Cybernetics: Systems · 2025

被引 1

ABS 3

Xuesong Wang
Hengrui Zhang
Jiazhi Zhang
C. L. Philip Chen
Yuhu Cheng

中文导读

提出一种悲观评论家决策变换器（PCDT），通过序列重要性采样惩罚偏离行为序列的动作，结合Q值更新策略，在稀疏奖励和长时域任务上取得最高归一化分数。

Abstract

DT, as a conditional sequence modeling (CSM) approach, learns the action distribution for each state using historical information, such as trajectory returns, offering a supervised learning paradigm for offline reinforcement learning (Offline RL). However, due to the fact that decision transformer (DT) solely concentrates on an individual trajectory with high returns-to-go, it neglects the potential for constructing optimal trajectories by combining sequences of different actions. In other words, traditional DT lacks the trajectory stitching capability. To address the concern, a novel DT (PCDT) for Offline RL is proposed. Our approach begins by pretraining a standard DT to explicitly capture behavior sequences. Next, we apply the sequence importance sampling to penalize actions that significantly deviate from these behavior sequences, thereby constructing a pessimistic critic. Finally, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Q</i>-values are integrated into the policy update process, enabling the learned policy to approximate the behavior policy while favoring actions associated with the highest <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Q</i>-value. Theoretical analysis shows that the sequence importance sampling in pessimistic critic decision transformer (PCDT) establishes a pessimistic lower bound, while the value optimality ensures that PCDT is capable of learning the optimal policy. Results on the D4RL benchmark tasks and ablation studies show that PCDT inherits the strengths of actor–critic (AC) and CSM methods, achieving the highest normalized scores on challenging sparse-reward and long-horizon tasks. Our code are available at https://github.com/Henry0132/PCDT.

强化学习离线强化学习决策变换器序列建模

阅读原文 ↗