多模态威布尔变分自编码器用于联合建模图像-文本数据

Multimodal Weibull Variational Autoencoder for Jointly Modeling Image-Text Data

IEEE Transactions on Cybernetics · 2021

被引 6

ABS 3

Chaojie Wang
Bo Chen
Zhengjue Wang
Penghui Wang
Sucheng Xiao
Hao Zhang

中文导读

提出一种多模态威布尔变分自编码器，结合泊松伽马信念网络与威布尔变分推断，快速提取可解释的多模态潜在表示，用于缺失模态填补和多模态检索等任务。

Abstract

For multimodal representation learning, traditional black-box approaches often fall short of extracting interpretable multilayer hidden structures, which contribute to visualize the connections between different modalities at multiple semantic levels. To extract interpretable multimodal latent representations and visualize the hierarchial semantic relationships between different modalities, based on deep topic models, we develop a novel multimodal Poisson gamma belief network (mPGBN) that tightly couples the observations of different modalities via imposing sparse connections between their modality-specific hidden layers. To alleviate the time-consuming Gibbs sampler adopted by traditional topic models in the testing stage, we construct a Weibull-based variational inference network (encoder) to directly map the observations to their latent representations, and further combine it with the mPGBN (decoder), resulting in a novel multimodal Weibull variational autoencoder (MWVAE), which is fast in out-of-sample prediction and can handle large-scale multimodal datasets. Qualitative evaluations on bimodal data consisting of image-text pairs show that the developed MWVAE can successfully extract expressive multimodal latent representations for downstream tasks like missing modality imputation and multimodal retrieval. Further extensive quantitative results demonstrate that both MWVAE and its supervised extension sMWVAE achieve state-of-the-art performance on various multimodal benchmarks.

多模态学习变分自编码器主题模型表示学习图像-文本建模

阅读原文 ↗