基于经验回放的政策梯度自适应评判设计用于无模型最优跟踪控制

Policy Gradient Adaptive Critic Designs for Model-Free Optimal Tracking Control With Experience Replay

IEEE Transactions on Systems, Man, and Cybernetics: Systems · 2021

被引 84

ABS 3

Mingduo Lin
Bo Zhao
Derong Liu

中文导读

针对离散时间非线性系统，提出一种无模型最优跟踪控制器，通过政策梯度自适应评判设计和经验回放，将跟踪问题转化为调节问题，并利用数据驱动方式更新网络权重，保证系统稳定性。

Abstract

A model-free optimal tracking controller is designed for discrete-time nonlinear systems through policy gradient adaptive critic designs (PGACDs) with experience replay (ER). By using system transformation, optimal tracking control problems are converted into optimal regulation problems. An off-policy PGACD algorithm is developed to minimize the iterative <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$Q$ </tex-math></inline-formula> -function and improve the tracking control performance. The proposed method is realized based on the critic network and the actor network (AN), which are applied to approximate the iterative <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$Q$ </tex-math></inline-formula> -function and the iterative control policy, respectively. Then, the policy gradient technique is introduced to derive a novel weight updating law of the AN explicitly by using measured system data only. The convergence of the iteration is established through theoretical analysis, and the uniform ultimate boundedness is demonstrated for the closed-loop system under the PGACD-based controller by using Lyapunov’s direct method. To guarantee the stability and increase the data usage efficiency of the learning process, an ER-based learning framework is designed to improve the realizability of the proposed method. Finally, simulation results of two examples are provided to demonstrate the performance of the off-policy PGACD algorithm.

最优控制自适应评判设计非线性系统强化学习跟踪控制

阅读原文 ↗