Maximal Objectives in the Multiarmed Bandit with Applications
研究了多臂老虎机问题中最大化各臂总奖励最大值的新目标,推导了理论下界并设计了自适应策略,适用于在线平台中保障充足优质市场主体的供给。
In several applications of the stochastic multiarmed bandit problem, the traditional objective of maximizing the expected total reward can be inappropriate. In this paper, we study a new objective in the classic setup. Given K arms, instead of maximizing the expected total reward from T pulls (the traditional “sum” objective), we consider the vector of total rewards earned from each of the K arms at the end of T pulls and aim to maximize the expected highest total reward across arms (the “max” objective). For this objective, we show that any policy must incur an instance-dependent asymptotic regret of [Formula: see text] (with a higher instance-dependent constant compared with the traditional objective) and a worst case regret of [Formula: see text]. We then design an adaptive explore-then-commit policy featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and simultaneously achieves these bounds (up to logarithmic factors). We then generalize our algorithmic insights to the problem of maximizing the expected value of the average total reward of the top m arms with the highest total rewards. Our numerical experiments demonstrate the efficacy of our policies compared with several natural alternatives in practical parameter regimes. We discuss applications of these new objectives to the problem of conditioning an adequate supply of value-providing market entities (workers/sellers/service providers) in online platforms and marketplaces. This paper was accepted by Vivek Farias, data science. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2022.00801 .