A p-median based approach to constrained clustering
研究了将p中位数模型与实例级约束(必须连接和不能连接)结合进行约束聚类,提出拉格朗日松弛算法和变量约简技术,在基准和家庭用电数据集上高效获得高质量解,且相比k均值聚类,约束p中位数聚类在能耗变异性、簇大小分布和收入分层方面表现更优。
Abstract Constrained clustering is a semi-supervised method for dividing multi-attribute datasets into groups of similar elements that incorporates restrictions on how data points should or should not be clustered together. In this study, we investigate the application of the p-median model coupled with instance-level constraints, specifically must-link and cannot-link constraint, for performing constrained clustering. A Lagrangian relaxation procedure is proposed to efficiently solve large problem instances. To further speed up solution times, a simple but highly effective variable reduction technique is devised for identifying a small set of potential cluster centers. We test our modeling framework and solution approach on a set of benchmark datasets and several real-world household electricity consumption datasets. We find that high-quality solutions with small optimality gaps can be obtained with moderate computational effort. In addition, analysis of the largest household electricity consumption dataset reveals that constrained p-median clusters have less extreme variability with respect to average energy usage, a more even spread of cluster sizes, and much less overlap between high- and low-income households within the same clusters compared to classic k-means clustering.