近端强化学习：部分可观测马尔可夫决策过程中的高效离线策略评估

Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

Operations Research · 2023

被引 23

人大 AFT50UTD24ABS 4*

Andrew Bennett · 康奈尔大学
Nathan Kallus · 康奈尔大学

中文导读

针对离线强化学习中观测数据受未观测因素干扰的问题，将近端因果推断框架扩展到部分可观测马尔可夫决策过程，提出近端强化学习方法，用于识别和估计目标策略的价值。

Abstract

In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors, inducing confounding and biasing estimates derived assuming a perfect Markov decision process (MDP) model. In “Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes,” A. Bennett and N. Kallus tackle this by considering off-policy evaluation in a partially observed MDP (POMDP). Specifically, they consider estimating the value of a given target policy in an unknown POMDP, given observations of trajectories generated by a different and unknown policy, which may depend on the unobserved states. They consider both when the target policy value can be identified the observed data and, given identification, how best to estimate it. Both these problems are addressed by extending the framework of proximal causal inference to POMDP settings, using sequences of so-called bridge functions. This results in a novel framework for off-policy evaluation in POMDPs that they term proximal reinforcement learning, which they validate in various empirical settings.

强化学习部分可观测马尔可夫决策过程离线策略评估因果推断

阅读原文 ↗