词级最大均值差异正则化用于词嵌入

Word-Level Maximum Mean Discrepancy Regularization for Word Embedding

Journal of the American Statistical Association · 2025

被引 0

ABS 4

Youqian Gao
Ben Dai 通讯

中文导读

提出一种词级最大均值差异正则化方法，通过保持词嵌入向量中的分布差异来防止过拟合，提升自然语言处理模型的鲁棒性和泛化能力。

Abstract

The technique of word embedding is widely used in natural language processing (NLP) to represent words as numerical vectors in textual datasets. However, the estimation of word embedding may suffer from severe overfitting due to the huge variety of words. To address the issue, this article proposes a novel regularization framework that recognizes and accounts for the “word-level distribution discrepancy”–a common phenomenon in a range of NLP tasks where word distributions are noticeably disparate under different labels. The proposed regularization, referred to as word-level MMD (wMMD), is a variant of maximum mean discrepancy (MMD) that serves a specific purpose: to enhance/preserve the distribution discrepancies within word embedding numerical vectors and thus prevent overfitting. Our theoretical analysis illustrates that wMMD can effectively operate as a dimension reduction technique of word embedding, thereby significantly improving the robustness and generalization of NLP models. The numerical effectiveness of wMMD and its variants is demonstrated in various simulated examples, CE-T1 and BBC News datasets with state-of-the-art NLP deep learning architectures.

自然语言处理词嵌入正则化过拟合深度学习

阅读原文 ↗