Q学习是最小最大最优的吗？一个紧的样本复杂度分析

Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

Operations Research · 2023

被引 23

人大 AFT50UTD24ABS 4*

Gen Li · 宾夕法尼亚大学
Changxiao Cai · 宾夕法尼亚大学
Yuxin Chen · 宾夕法尼亚大学
Yuting Wei · 宾夕法尼亚大学
Yuejie Chi · 卡内基梅隆大学

中文导读

研究了Q学习在强化学习中的样本效率，证明单动作时Q学习是最小最大最优的，但多动作时存在严格次优性，并揭示了过估计的负面影响。

Abstract

This paper investigates a model-free algorithm of broad interest in reinforcement learning, namely, Q-learning. Whereas substantial progress had been made toward understanding the sample efficiency of Q-learning in recent years, it remained largely unclear whether Q-learning is sample-optimal and how to sharpen the sample complexity analysis of Q-learning. In this paper, we settle these questions: (1) When there is only a single action, we show that Q-learning (or, equivalently, TD learning) is provably minimax optimal. (2) When there are at least two actions, our theory unveils the strict suboptimality of Q-learning and rigorizes the negative impact of overestimation in Q-learning. Our theory accommodates both the synchronous case (i.e., the case in which independent samples are drawn) and the asynchronous case (i.e., the case in which one only has access to a single Markovian trajectory).

强化学习样本复杂度Q学习马尔可夫决策过程最小最大最优性

阅读原文 ↗