多项式逻辑斯谛情境赌博机的一个易处理在线学习算法

A tractable online learning algorithm for the multinomial logit contextual bandit

European Journal of Operational Research · 2023

被引 2

ABS 4

Priyank Agrawal 通讯
Theja Tulabandhula
Vashist Avadhanula

中文导读

针对多项式逻辑斯谛情境赌博机问题，提出一种乐观算法，将遗憾上界从O(κd√T)改进为O(d√T+κ)，并通过凸松弛实现易处理的决策，数值实验验证了算法对κ值的鲁棒性。

Abstract

In this paper, we consider the contextual variant of the MNL-Bandit problem. More specifically, we consider a dynamic set optimization problem, where a decision-maker offers a subset (assortment) of products to a consumer and observes the response in every round. Consumers purchase products to maximize their utility. We assume that a set of attributes describe the products, and the mean utility of a product is linear in the values of these attributes. We model consumer choice behavior using the widely used Multinomial Logit (MNL) model and consider the decision makers problem of dynamically learning the model parameters while optimizing cumulative revenue over the selling horizon T . Though this problem has recently attracted considerable attention, many existing methods often involve solving an intractable non-convex optimization problem. Their theoretical performance guarantees depend on a problem-dependent parameter which could be prohibitively large. In particular, current algorithms for this problem have regret bounded by O ( κ d T ) , where κ is a problem-dependent constant that may have an exponential dependency on the number of attributes, d . In this paper, we propose an optimistic algorithm and show that the regret is bounded by O ( d T + κ ) , significantly improving the performance over existing methods. Further, we propose a convex relaxation of the optimization step, which allows for tractable decision-making while retaining the favorable regret guarantee. We also demonstrate that our algorithm has robust performance for varying κ values through numerical experiments.

机器学习运筹优化消费者选择模型在线学习

阅读原文 ↗