一种仅使用正例和无标签数据的贝叶斯半监督关键词提取方法

A Bayesian Semisupervised Approach to Keyword Extraction with Only Positive and Unlabeled Data

INFORMS journal on computing · 2023

被引 3

人大 BUTD24ABS 3

Qiang Ling · 中国科学技术大学
Guanshen Wang · 南卫理公会大学
Yichen Cheng · 佐治亚州立大学
Yusen Xia · 佐治亚州立大学
Xinlei Wang · 得克萨斯大学阿灵顿分校

中文导读

提出一种贝叶斯半监督概率模型，利用图信息和少量已知关键词，通过马尔可夫链蒙特卡洛算法进行推断，并用错误发现率控制关键词数量，在基准数据上优于现有方法。

Abstract

In the era of big data, people benefit from the existence of tremendous amounts of information. However, availability of said information may pose great challenges. For instance, one big challenge is how to extract useful yet succinct information in an automated fashion. As one of the first few efforts, keyword extraction methods summarize an article by identifying a list of keywords. Many existing keyword extraction methods focus on the unsupervised setting, with all keywords assumed unknown. In reality, a (small) subset of the keywords may be available for a particular article. To use such information, we propose a rigorous probabilistic model based on a semisupervised setup. Our method incorporates the graph-based information of an article into a Bayesian framework via an informative prior so that our model facilitates formal statistical inference, which is often absent from existing methods. To overcome the difficulty arising from high-dimensional posterior sampling, we develop two Markov chain Monte Carlo algorithms based on Gibbs samplers and compare their performance using benchmark data. We use a false discovery rate (FDR)-based approach for selecting the number of keywords, whereas the existing methods use ad hoc threshold values. Our numerical results show that the proposed method compared favorably with state-of-the-art methods for keyword extraction. History: Accepted by Ramaswamy Ramesh, Area Editor for Data Science and Machine Learning. Supplemental Material: The software that supports the findings of this study is available within the paper and its Supplemental Information ( https://pubsonline.informs.org/doi/suppl/10.1287/ijoc.2023.1283 ) as well as from the IJOC GitHub software repository ( https://github.com/INFORMSJoC/2021.0234 ) at ( http://dx.doi.org/10.5281/zenodo.7348935 ).

关键词提取贝叶斯方法半监督学习马尔可夫链蒙特卡洛

阅读原文 ↗