🌙

整合多源异构仅正样本数据的欺诈检测方法

Fraud Detection by Integrating Multisource Heterogeneous Presence-Only Data

INFORMS journal on computing · 2024
被引 0
人大 BUTD24ABS 3

中文导读

提出一种整合多源异构正无标签数据的PU学习方法(I-PU),通过惩罚组差异自动识别系数聚类结构,并采用双层选择检测稀疏结构,理论证明具有oracle性质,模拟和真实数据表明其在变量选择、参数估计和预测上优于直接合并或单独建模。

Abstract

In credit fraud detection practice, certain fraudulent transactions often evade detection because of the hidden nature of fraudulent behavior. To address this issue, an increasing number of positive-unlabeled (PU) learning techniques have been employed by more and more financial institutions. However, most of these methods are designed for single data sets and do not take into account the heterogeneity of data when they are collected from different sources. In this paper, we propose an integrative PU learning method (I-PU) for pooling information from multiple heterogeneous PU data sets. A novel approach that penalizes group differences is developed to explicitly and automatically identify the cluster structures of coefficients across different data sets, thus offering a plausible interpretation of heterogeneity. Furthermore, we apply a bilevel selection method to detect the sparse structure at both the group level and within-group level. Theoretically, we show that our proposed estimator has the oracle property. Computationally, we design an expectation-maximization (EM) algorithm framework and propose an alternating direction method of multipliers (ADMM) algorithm to solve it. Simulation results show that our proposed method has better numerical performance in terms of variable selection, parameter estimation, and prediction ability. Finally, a real-world application showcases the effectiveness of our method in identifying distinct coefficient clusters and its superior prediction performance compared with direct data merging or separate modeling. This result also offers valuable insights for financial institutions in developing targeted fraud detection systems. History: Accepted by Ram Ramesh, Area Editor for Data Science & Machine Learning. Funding: This work was supported by the National Natural Science Foundation of China [Grants 72071169, 72231005, 72233002, and 72471169], the Fundamental Research Funds for the Central Universities of China [Grant 20720231060], the National Social Science Fund of China [Grant 21&ZD146], and Shuimu Tsinghua Scholar Program. Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information ( https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2023.0366 ) as well as from the IJOC GitHub software repository ( https://github.com/INFORMSJoC/2023.0366 ). The complete IJOC Software and Data Repository is available at https://informsjoc.github.io/ .

信用欺诈检测正无标签学习数据挖掘机器学习