多头多智：面向文本型视觉问答的多模态图推理

So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering

IEEE Transactions on Systems, Man, and Cybernetics: Systems · 2023

被引 3

ABS 3

Wenbo Zheng
Lan Yan
Fei–Yue Wang

中文导读

提出一种多模态图推理模型，通过构建场景文本、问题和图像三种模态的语义图，并用多头自注意力机制捕捉模态间关系，显著提升了文本型视觉问答的准确率。

Abstract

While texts related to images convey fundamental messages for scene understanding and reasoning, text-based visual question answering tasks concentrate on visual questions that require reading texts from images. However, most current methods add multimodal features that are independently extracted from a given image into a reasoning model without considering their inter- and intra-relationships according to three modalities (i.e., scene texts, questions, and images). To this end, we propose a novel text-based visual question answering model, multimodal graph reasoning. Our model first extracts intramodality relationships by taking the representations from identical modalities as semantic graphs. Then, we present graph multihead self-attention, which boosts each graph representation through graph-by-graph aggregation to capture the intermodality relationship. It is a case of “so many heads, so many wits” in the sense that as more semantic graphs are involved in this process, each graph representation becomes more effective. Finally, these representations are reprojected, and we perform answer prediction with their outputs. The experimental results demonstrate that our approach realizes substantially better performance compared with other state-of-the-art models.

计算机科学视觉问答多模态推理图神经网络自然语言处理

阅读原文 ↗