My caption

FocusTR: Focusing on Valuable Feature by Multiple Transformers for Fusing Feature Pyramid on Object Detection

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022, DOI: 10.1109/IROS47612.2022.9981047

Abstract: The feature pyramid, which is a vital component of the convolutional neural networks, plays a significant role in several perception tasks, including object detection for autonomous driving. However, how to better fuse multi-level and multi-sensor feature pyramids is still a significant challenge, especially for object detection. This paper presents a FocusTR (Focusing on the valuable features by multiple Transformers), which is a simple yet effective architecture, to fuse feature pyramid for the single-stream 2D detector and two-stream 3D detector. Specifically, FocusTR encompasses several novel self-attention mechanisms, including the spatial-wise boxAlign attention (SB) for low-level spatial locations, context-wise affinity attention (CA) for high-level context information, and level-wise attention for the multi-level feature. To alleviate self-attention’s computational complexity and slow training convergence, FocusTR introduces a low and high-level fusion (LHF) to reduce the computational parameters, and the Pre-LN to accelerate the training convergence.