蔡俊辉、杨丹、赵琳达和吴武对Rohe与Zeng的《带Varimax旋转的复古因子分析进行统计推断》讨论的贡献

Junhui Cai, Dan Yang, Linda Zhao and Wu Zhu's contribution to the Discussion of ‘Vintage Factor Analysis with Varimax Performs Statistical Inference’ by Rohe & Zeng

Journal of the Royal Statistical Society. Series B: Statistical Methodology · 2023

被引 0

ABS 4

Dan Yang
Wu Zhu
Junhui Cai 通讯
Linda Zhao

中文导读

提出层次复古稀疏PCA（Hvsp）方法，结合层次聚类与vsp，用于识别网络中的层次结构因子/社区及重要节点，在统计引文网络数据中展示了应用。

Abstract

We congratulate the authors on their excellent article (Rohe & Zeng, 2022). Factors and communities in networks are often hierarchically structured, as demonstrated in the academic bibliometrics example in the paper. In order to (1) identify factors/communities with hierarchical structure and (2) identify the important individuals/nodes within these factors/communities, we propose hierarchical vintage sparse PCA (Hvsp) to account for the hierarchical structure while taking advantages of vsp’s capability of performing statistical inference. Hvsp combines the idea of hierarchical clustering and vsp. Specifically, Hvsp follows a top-down hierarchical partitioning by recursively applying vsp with dimension k = 2 to split the nodes into two communities and eventually produce a binary tree. Compared with the existing hierarchical clustering methodologies in network analysis, Hvsp can explore the hierarchical structure but also inherits vsp’s advantages in computation and interpretability. In addition, the rotated principal component provides a score of importance for each individual/node in its corresponding factor/community, analogous to the popular eigenvector centrality measure in network analysis. The detailed algorithm is described in Algorithm 1. To gauge the performance of Hvsp in community detection, we adopt the binary tree stochastic block models (BTSBM) that capture a binary tree community structure (Li et al., 2020).1 We first use a toy example of a four-cluster balanced BTSBM (Figure 1) to provide insights and compare singular value decomposition (SVD) vs. vsp with dimension k = 4 and hierarchical community detection (HCD) (Li et al., 2020) vs. Hvsp in Figures 2 and 3. As expected, we observe the radial streaks from the pairs of principal components in Figure 2a, and the Varimax rotation aligns the streaks with the coordinate axes in Figure 2b. However, neither accounts for the hierarchical structure. On the other hand, HCD and Hvsp split the community layer by layer. In addition, Hvsp aligns the principal components to the coordinate axes, which provide a measure of the importance/centrality of each node in each community at different levels. We further compare the clustering performance using normalized mutual information (NMI) (Yao, 2003) of HCD, Hvsp, and vsp varying the number of communities and the average degree of nodes in Figure 4. HCD and Hvsp perform similarly while vsp falls behind. A four-cluster binary tree stochastic block models (BTSBM) (Li et al., 2020). The binary tree has three layers where Layer 1 includes all nodes, Layer 2 splits into two mega-communities {0, 1}, and Layer 3 further splits into four communities {00, 01, 10, 11}. Each colour corresponds to each community in Layer 3. In Layer 2, the mega-community {0} includes {00, 01} (red and purple) and the mega-community {1} includes {10, 11} (green and teal). Edges between nodes within the same community/mega-community are assumed to be independently Bernoulli with probability p0, p1, and p2 depending on the layer. It is most natural to assume the communities are assortative p0 > p1 > p2 so that the communities are more closely connected as the hierarchical tree goes deeper; or vice versa dis-assortative where p0 < p1 < p2. In the toy example, we generate a balanced four-clustered BTSBM with 2,048 nodes where each mega-community at Layer 2 has 1,024 nodes and each community at Layer 3 has 512 nodes. We let p0 = 1, p1 = 0.3, and p2 = 0.09 and scale accordingly so that the average degree of nodes is expected to be 50. Scatter plot of pairs of principal components by SVD in figure (a) and pairs of varimax rotated components in figure (b). The colour corresponds to each community at Layer 3 in Figure 1. The radial streaks appear in figure (a) while the Varimax rotation aligns the streaks with the coordinate axes in figure (b), providing a sparse representation. However, neither provides a hierarchical structure. Scatter plot of pairs of principal components by SVD and pairs of Varimax rotated components. The rows correspond to hierarchical community detection (HCD-sign) (Li et al., 2020), which first performs SVD with dimension k = 2 and then assigns labels based on the sign of the second component, and the proposed Hvsp; while the columns correspond to the first split among all nodes (Layer 1) and the split of the mega-community {0} of {00, 01} and mega-community {1} of {10, 11} (Layer 2). The colour corresponds to each community at Layer 3 in Figure 1. HCD and Hvsp split the community layer by layer and reveal the hierarchical structure. In addition, Hvsp aligns the principal components to the coordinate axes so as to provide a sparse representation. The rotated components further provide a measure of the importance (importance score) of each node in each community as suggested by Rohe & Zeng (2022). Furthermore, the importance score can be provided by layers. The normalized mutual information (NMI) (Yao, 2003) between the true and estimated labels obtained by HCD-sign, Hvsp, and vsp varying the number of communities and the average degree of nodes. The simulation setup follows Section 4.1 in Li et al. (2020). A larger NMI suggests better clustering performance. HCD-sign and Hvsp perform similarly while vsp falls behind. We compare the performance with more metrics at https://github.com/cccfran/Hvsp-paper. Finally, we apply Hvsp to the three-core of the largest connected component of a statistics citation network (2003–2012) (Ji & Jin, 2016; Li et al., 2020). Figure 5 shows the hierarchical communities whose labels are based on the research interests of the ten statisticians with the highest scores within each community in Table 1. Hvsp clusters related communities together and the communities become more refined as the hierarchical tree goes deeper. The dendrogram of 11 communities of the three-core of the statistics citation network from 2003 to 2012 was obtained by Hvsp using edge cross-validation (ECV) as a stopping rule (Li et al., 2020). Research areas are manually labelled based on the research interests of the 10 statisticians with highest importance scores in Table 1, which are followed by the community size labelled in parentheses. The labelling can be made algorithmic such as using the ‘best feature function’ bff (Wang & Rohe, 2016). We provide the clustering result of using the nonbacktracking method (Le & Levina, 2015) as the stopping rule in https://github.com/cccfran/Hvsp-paper. The 15 statisticians with the highest importance scores in each community of the 2003–2012 citation network The hierarchical vintage sparse PCA (Hvsp) algorithm. We included a Github repo in our manuscript that provides all the codes and data.

网络分析因子分析社区检测层次聚类统计推断

阅读原文 ↗