Online Learning and Decision Making Under Generalized Linear Model with High-Dimensional Data
提出一种基于极小极大凹惩罚的广义线性模型多臂老虎机算法,在高维数据环境下实现最优累积遗憾,并通过两步加权Lasso方法估计参数,实验表明该算法优于基准方法。
We propose a minimax concave penalized multiarmed bandit algorithm under the generalized linear model (G-MCP-Bandit) for decision-makers facing high-dimensional data in an online learning and decision-making environment. We demonstrate that in the data-rich regime, the G-MCP-Bandit algorithm attains the optimal cumulative regret in the sample size dimension and a tight bound in the covariate dimension and the significant covariate dimension. In the data-poor regime, the G-MCP-Bandit algorithm maintains a tight regret upper bound. In addition, we develop a local linear approximation method, the two-step weighted Lasso procedure, to identify the minimax concave penalty (MCP) estimator for the G-MCP-Bandit algorithm when samples are not independent and identically distributed. Under this procedure, the MCP estimator can match the oracle estimator with high probability and converge to the true parameters at the optimal convergence rate. Finally, through experiments based on both synthetic and real data sets, we show that the G-MCP-Bandit algorithm outperforms other benchmarking algorithms in terms of cumulative regret and that the benefits of the G-MCP-Bandit algorithm increase in the data’s sparsity level and the size of the decision set. This paper was accepted by J. George Shanthikumar, big data analytics. Funding: This work was supported by the National Natural Science Foundation of China (NSFC) [Grants W2441021, 72371172, 72342023, and 71929101] and the National Science Foundation (NSF) [Grant DMS 1820702]. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2022.01557 .