Hybrid Embedding SAM-Guided Feedback Network for RGB–Thermal Urban Scene Parsing
提出一种基于SAM框架的混合嵌入反馈网络,通过模态结构对齐和跨架构知识迁移,提升RGB-热红外城市街景分割的精度和鲁棒性,在多个数据集上平均准确率提升约5%。
In multimodal semantic segmentation tasks of urban street scenes, existing methods lack modeling of intermodal structural alignment and semantic cooperation between architectures, leading to insufficient fusion feature representations. To address this issue, this article proposes a novel structural optimization network: a hybrid embedding segment anything model (SAM) guided feedback network (GFNet). This network is based on the SAM framework and achieves multimodal structural alignment by transforming the semantic prior (SP) extractor through module-level fine-tuning of the image encoder. Furthermore, this article proposes a cross-architecture knowledge transfer (CAKT) mechanism, injecting the structural awareness capability of SAM into the backbone features of each layer, achieving dual optimization of alignment and enhancement. To address the issues of intermodal heterogeneity and semantic conflict, this article combines complementary fusion at different frequencies and cross-modal similarity enhancement strategies to achieve fine-grained semantic fusion and consistency modeling, supplemented by a dual-supervised constraint mechanism to improve modal independence and robustness. On several challenging datasets, mAcc is improved by about 5%, and GFNet demonstrates the superior segmentation performance and robustness compared to existing methods. Our code will be released to the public at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/WBangG/GFNet</uri>