关于‘多尺度Fisher独立性检验用于多元依赖关系’的讨论

Discussion of ‘Multi-scale Fisher’s independence test for multivariate dependence’

Biometrika · 2022

被引 1

ABS 4

Antonin Schrab 通讯
Wittawat Jitkrittum
Zoltán Szabó
Dino Sejdinović
Arthur Gretton

中文导读

本文讨论Gorsky & Ma (2022)提出的多尺度Fisher独立性检验(MultiFIT)，将其与基于核的独立性检验方法(如HSIC)进行比较，评估各方法在计算效率和检验功效上的优劣。

Abstract

We read with interest the work of Gorsky & Ma (2022) on statistical dependence testing using a multi-scale Fisher’s independence test, MultiFIT. The procedure consists of first transforming the data to map to the unit ball, then performing univariate Fisher’s exact tests of independence on a collection of |$2 \times 2$| contingency tables, and finally correcting for the use of multiple testing. The collection is obtained using a divide-and-conquer approach with a coarse-to-fine procedure: the unit ball is partitioned into cuboids, and |$2 \times 2$| contingency tables of counts of samples in the cuboids are tested; the cuboids with small associated |$p$|-values are then further partitioned at finer resolutions and tested again. This approach has a number of advantages; chief among them are that the test is multivariate, the computational cost is in general |${O}(n\log n)$| as a function of the sample size |$n$|⁠, and the test threshold is exact at any sample size, not an asymptotic limit. The problem of computationally efficient, linear-time dependence testing is important to address, and a number of approaches have been proposed in the machine learning and statistics literature. In the present discussion, we will provide brief descriptions of some of these approaches, enumerating their advantages and disadvantages in comparison with MultiFIT. We will evaluate the performance in terms of power of all tests on both synthetic and real-world data. We begin by placing the statistics against which Gorsky & Ma (2022) compared, namely the Brownian distance covariance and its generalizations (Székely & Rizzo, 2009; Lyons, 2013), within the broader framework of kernel-based independence testing. Sejdinovic et al. (2013, Theorem 24) established that the generalized distance covariance is an instance of a Hilbert–Schmidt independence criterion, HSIC (Gretton et al., 2005), which is the Hilbert–Schmidt norm of a covariance operator between features of |$X$| and features of |$Y$| in respective reproducing kernel Hilbert spaces. When these reproducing kernel Hilbert spaces are sufficiently rich, i.e., characteristic (Sriperumbudur et al., 2010), then the HSIC is zero if and only if |$X$| and |$Y$| are independent (Gretton, 2015; Szabó & Sriperumbudur, 2018). This is always the case for exponentiated quadratic kernels, and it is also true for the family of distance-induced kernels that define the Brownian distance covariance, subject to appropriate moment conditions (Sejdinovic et al., 2013, Proposition 29 and Remark 31). Statistical tests of independence using the HSIC have been proposed by Gretton et al. (2007), Chwialkowski & Gretton (2014) and Chwialkowski et al. (2014), with computational cost |${O}(n^2)$|⁠. In the next section, we describe approaches developed from these kernel statistics to yield greater computational efficiency and improved power.

统计学多元统计独立性检验核方法计算效率

免费全文 ↗阅读原文 ↗