Learning-Based Control Policy and Regret Analysis for Online Quadratic Optimization With Asymmetric Information Structure
针对非对称信息结构下的动态系统,提出一种在线学习控制策略,利用遗憾分析衡量性能损失,证明遗憾为次线性且受O(lnT)界。
In this article, we propose a learning approach to analyze dynamic systems with an asymmetric information structure. Instead of adopting a game-theoretic setting, we investigate an online quadratic optimization problem driven by system noises with unknown statistics. Due to information asymmetry, it is infeasible to use the classic Kalman filter nor optimal control strategies for such systems. It is necessary and beneficial to develop an admissible approach that learns the probability statistics as time goes forward. Motivated by the online convex optimization (OCO) theory, we introduce the notion of regret, which is defined as the cumulative performance loss difference between the optimal offline-known statistics cost and the optimal online-unknown statistics cost. By utilizing dynamic programming and linear minimum mean square biased estimate (LMMSUE), we propose a new type of online state-feedback control policy and characterize the behavior of regret in a finite-time regime. The regret is shown to be sublinear and bounded by O(lnT) . Moreover, we address an online optimization problem with output-feedback control policy and propose a heuristic online control policy.