🌙

网站嵌套网页文本建模中的局部与全局主题

Local and global topics in text modeling of web pages nested in web sites

Computational Statistics and Data Analysis · 2022
被引 2
ABS 3

中文导读

针对网站嵌套网页的层次结构,提出一种层次化先验的主题模型,区分全局主题和局部主题,识别局部主题所属网站,并用于健康网站主题覆盖分析。

Abstract

Topic models assert that documents are distributions over latent topics and latent topics are distributions over words. A nested document collection has documents nested inside a higher order structure such as articles nested in journals, podcasts within authors, or web pages nested in web sites. In a single collection of documents, topics are global or shared across all documents. For web pages nested in web sites, topic frequencies likely vary across web sites and within a web site, topic frequencies almost certainly vary from web page to web page. A hierarchical prior for topic frequencies models this hierarchical structure with a global topic distribution, web site topic distributions varying around the global topic distribution, and web page topic distributions varying around the web site topic distribution. Web pages in one United States local health department web site often contain local geographic and news topics not found on web pages of other local health department web sites. For web pages nested in web sites, some topics are likely local topics and unique to an individual web site. Regular topic models ignore the nesting structure and may identify local topics but cannot label those topics as local nor identify the corresponding web site owner. Explicitly modeling local topics identifies the owning web site and identifies the topic as local. In US health web site data, topic coverage is defined at the web site level after removing local topic words from pages. Hierarchical local topic models can be used to study how well health topics are covered.

文本建模主题模型网页分析信息检索