AIA-Net: Adaptive Interactive Attention Network for Text–Audio Emotion Recognition
提出AIA-Net,将文本作为主模态、音频作为辅助模态,通过自适应交互注意力权重聚焦有效声学特征,在三个基准数据集上超越现有方法。
Emotion recognition based on text-audio modalities is the core technology for transforming a graphical user interface into a voice user interface, and it plays a vital role in natural human-computer interaction systems. Currently, mainstream multimodal learning research has designed various fusion strategies to learn intermodality interactions but hardly considers that not all modalities play equal roles in emotion recognition. Therefore, the main challenge in multimodal emotion recognition is how to implement effective fusion algorithms based on the auxiliary structure. To address this problem, this article proposes an adaptive interactive attention network (AIA-Net). In AIA-Net, text is treated as a primary modality, and audio is an auxiliary modality. AIA-Net adapts to textual and acoustic features with different dimensions and learns their dynamic interactive relations in a more flexible way. The interactive relations are encoded as interactive attention weights to focus on the acoustic features that are effective for textual emotional representations. AIA-Net performs well in adaptively assisting the textual emotional representation with the acoustic emotional information. Moreover, multiple collaborative learning (co-learning) layers of AIA-Net achieve multiple multimodal interactions and the deep bottom-up evolution of emotional representations. Experimental results on three benchmark datasets demonstrate the great effectiveness of the proposed method over the state-of-the-art methods.