The Price of Incentivizing Exploration: A Characterization via Thompson Sampling and Sample Complexity
研究了多臂老虎机中算法如何通过信息不对称激励自利代理人探索,发现汤普森采样在初始数据充足时是激励相容的,性能损失仅限于收集样本的初始轮次,并刻画了样本复杂度与代理人信念及臂数的关系。
We consider “incentivized exploration”: a version of multiarmed bandits where the choice of arms is controlled by self-interested agents. The algorithm can only issue recommendations and needs to incentivize the agents to explore, even though they prefer to exploit. The algorithm controls the flow of information and relies on information asymmetry to create incentives. This line of work, motivated by misaligned incentives in recommendation systems, has received considerable attention in the “economics and computation” community. We focus on the “price of incentives”: the loss in performance, broadly construed, incurred for the sake of incentive compatibility. We prove that Thompson sampling, a standard bandit algorithm, is incentive compatible if initialized with sufficiently many data points. The performance loss because of incentives is, therefore, limited to the initial rounds when these data points are collected. The problem is largely reduced to that of sample complexity. How many rounds are needed to collect even one sample of each arm? We zoom in on this question, which is perhaps the most basic question one could ask about incentivized exploration. We characterize the dependence on agents' beliefs and the number of arms (which was essentially ignored in prior work), providing matching upper and lower bounds. Typically, the optimal sample complexity is polynomial in the number of arms and exponential in the “strength of beliefs.”