Online Semi-Supervised Classification on Multilabel Evolving High-Dimensional Text Streams
提出一种在线半监督分类算法OSMTS,利用少量标记实例动态维护每个标签的词子空间微簇,通过非参数狄利克雷模型预测,并处理渐变和突变概念漂移,实验对比12种算法在9个数据集上的性能、运行时间和内存消耗。
The multilabel learning task aims to predict the associated multiple classes of a given example simultaneously. Such task becomes more challenging when data arrives in stream since it requires concept drift adaptative, robust, and fast algorithm. In this article, we present an online semi-supervised classification algorithm (OSMTS) for multilabel text streams. By leveraging a few labeled instances, OSMTS dynamically maintains the subspace of terms for each label with a set of evolving micro-clusters. For multilabel classification, <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula> nearest micro-clusters are employed for prediction by using a nonparametric Dirichlet model. To handle the gradual concept drift in term space, the triangular time function is adopted to calculate the difference between term arriving time and cluster life span. Whereas, abrupt concept drift is dealt by considering two procedures: 1) deleting outdated micro-cluster by exploiting the exponential decay function and 2) creating new micro-clusters by adopting the Chinese restaurant process based on the Dirichlet process. The conducted experimental study provides a comparison with 12 state-of-the-art algorithms on nine datasets in terms of classification performance, runtime, and memory consumption.