RGB-T Tracking with Template-Bridged Search Interaction and Target-Preserved Template Updating

IEEE Trans Pattern Anal Mach Intell. 2024 Oct 7:PP. doi: 10.1109/TPAMI.2024.3475472. Online ahead of print.

Abstract

The goal of RGB-Thermal (RGB-T) tracking is to utilize the synergistic and complementary strengths of RGB and TIR modalities to enhance tracking in diverse situations, with cross-modal interaction being a crucial element. Earlier methods often simply combine the features of the RGB and TIR search frames, leading to a coarse interaction that also introduced unnecessary background noise. Many other approaches sample candidate boxes from search frames and apply different fusion techniques to individual pairs of RGB and TIR boxes, which confines cross-modal interactions to local areas and results in insufficient context modeling. Additionally, mining video temporal contexts is also under-explored in RGB-T tracking. To alleviate these limitations, we propose a novel Template-Bridged Search region Interaction (TBSI) module that exploits templates as the medium to bridge the cross-modal interaction between RGB and TIR search regions by gathering and distributing target-relevant object and environment contexts. An Illumination Guided Fusion (IGF) module is designed to adaptively fuse RGB and TIR search region tokens with a global illumination factor. Furthermore, in the inference stage, we also propose an efficient Target-Preserved Template Updating (TPTU) strategy, leveraging the temporal context within video sequences to accommodate the target's appearance change. Our proposed modules are integrated into a ViT backbone for joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate our method achieves new state-of-the-art performances. Code is available at https://github.com/RyanHTR/TBSI.