A counterexample and a corrective to the vector extension of the Bellman equations of a Markov decision process
本文用一个反例证明White提出的向量奖励马尔可夫决策过程贝尔曼方程扩展在一般条件下不成立,并给出了该方程成立的一个充分条件,指出其解是帕累托有效策略回报集。
Abstract Under the expected total reward criterion, the optimal value of a finite-horizon Markov decision process can be determined by solving the Bellman equations. The equations were extended by White to processes with vector rewards. Using a counterexample, we show that the assumptions underlying this extension fail to guarantee its validity. Analysis of the counterexample enables us to articulate a sufficient condition for White’s functional equations to be valid. The condition is shown to be true when the policy space has been refined to include a special class of non-Markovian policies, when the dynamics of the model are deterministic, and when the decision making horizon does not exceed two time steps. The paper demonstrates that in general, the solutions to White’s equations are sets of Pareto efficient policy returns over the refined policy space. Our results are illustrated with an example.