An Optimization Framework With Imbalance-Ratio and Position for Imbalanced Noisy Classification
针对不平衡分类和标签噪声问题,提出一个优化框架OIRP,通过构建两个优化模型确定新样本的最佳生成数量和位置,提升现有过采样算法性能,实验验证其有效性。
Imbalanced classification and label noise are two problems in machine learning and data mining. These two problems are widely present in real-world datasets. Data oversampling constitutes one of the potential solutions. Nonetheless, there are certain inherent limitations of current oversampling techniques. Traditional oversamplers are inadequate in ascertaining the quantity of new samples and can not mitigate the impact of noise on new samples. To solve these two problems theoretically, we propose an optimization framework for imbalance-ratio and position (OIRP). OIRP constructs two optimization models utilizing data on feature distribution, class ratio, quantity, and dimensionality to determine the optimal quantity to generate and the optimal position of new samples. These two models are demonstrated to possess optimal solutions. OIRP is employed to enhance current oversampling algorithms. It is adaptive, devoid of parameters, and universally applicable. The experiments are based on dozens of datasets with different quantities, imbalance ratios, noise rates, classical and sota comparison algorithms. The experimental results show that OIRP significantly improves existing oversamplers. The code, datasets, and experimental results of this work can be found in the https://github.com/adsl305480885/OIRP-binary-Oversampling.