Abstract: The feature pyramid, which is a vital component of the convolutional neural networks, plays a significant role in several perception tasks, including object detection for autonomous driving. However, how to better fuse multi-level and multi-sensor feature pyramids is still a significant challenge, especially for object detection. This paper presents a FocusTR (Focusing on the valuable features by multiple Transformers), which is a simple yet effective architecture, to fuse feature pyramid for the single-stream 2D detector and two-stream 3D detector. Specifically, FocusTR encompasses several novel self-attention mechanisms, including the spatial-wise boxAlign attention (SB) for low-level spatial locations, context-wise affinity attention (CA) for high-level context information, and level-wise attention for the multi-level feature. To alleviate self-attention’s computational complexity and slow training convergence, FocusTR introduces a low and high-level fusion (LHF) to reduce the computational parameters, and the Pre-LN to accelerate the training convergence.