状态聚合下策略梯度方法的近似优势

Approximation Benefits of Policy Gradient Methods with Aggregated States

Management Science · 2023

被引 3

人大 A+FT50UTD24ABS 4*

Daniel Russo · 哥伦比亚大学通讯

中文导读

研究了在状态聚合表示下，策略梯度方法相比近似策略迭代和近似值迭代，对近似误差更鲁棒，其每期遗憾以聚合误差ϵ为界。

Abstract

Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, in which the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per period is bounded by ϵ, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as [Formula: see text], where γ is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision objective can be far more robust. This paper was accepted by Hamid Nazerzadeh, data science. Supplemental Material: Data are available at https://doi.org/10.1287/mnsc.2023.4788 .

策略梯度状态聚合近似策略迭代近似误差

阅读原文 ↗