🌙

列子集选择的统计视角

A statistical view of column subset selection

Journal of the Royal Statistical Society. Series B: Statistical Methodology · 2025
被引 1
ABS 4

中文导读

本文证明了计算机科学中的列子集选择与统计学中的主变量选择等价,并可在半参数模型下视为最大似然估计,为高维数据下的变量选择提供了新方法。

Abstract

We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as column subset selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of principal variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum-likelihood estimation within a certain semi-parametric model. Within this model, we establish suitable conditions under which the CSS estimate is consistent in high dimensions, specifically in the proportional asymptotic regime where the number of variables over the sample size converges to a constant. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.

统计学计算机科学降维变量选择高维数据分析