WToE:在多智能体强化学习中学习何时探索

WToE: Learning When to Explore in Multiagent Reinforcement Learning

IEEE Transactions on Cybernetics · 2023
被引 10
ABS 3

中文导读

提出WToE方法,通过变分推断学习智能体在非平稳环境中何时探索,利用短期推断策略与当前策略的差异触发探索,理论保证Q值收敛,实验在网格、粒子环境和MAgent游戏中优于现有方法。

Abstract

Existing multiagent exploration works focus on how to explore in the fully cooperative task, which is insufficient in the environment with nonstationarity induced by agent interactions. To tackle this issue, we propose When to Explore (WToE), a simple yet effective variational exploration method to learn WToE under nonstationary environments. WToE employs an interaction-oriented adaptive exploration mechanism to adapt to environmental changes. We first propose a novel graphical model that uses a latent random variable to model the step-level environmental change resulting from interaction effects. Leveraging this graphical model, we employ the supervised variational auto-encoder (VAE) framework to derive a short-term inferred policy from historical trajectories to deal with the nonstationarity. Finally, agents engage in exploration when the short-term inferred policy diverges from the current actor policy. The proposed approach theoretically guarantees the convergence of the Q -value function. In our experiments, we validate our exploration mechanism in grid examples, multiagent particle environments and the battle game of MAgent environments. The results demonstrate the superiority of WToE over multiple baselines and existing exploration methods, such as MAEXQ, NoisyNets, EITI, and PR2.

多智能体强化学习探索策略非平稳环境变分推断