Visual Computing

SAM-Guided Masked Token Prediction for 3D Scene Understanding

Abstract: Foundation models have significantly enhanced 2D task performance, and recent works like Bridge3D have successfully applied these models to improve 3D scene understanding through knowledge distillation, marking considerable advancements. Nonetheless, challenges such as the misalignment between 2D and 3D representations and the persistent long-tail distribution in 3D datasets still restrict the effectiveness of knowledge distillation from 2D to 3D using foundation models.

Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models

Abstract: Foundation models have made significant strides in 2D and language tasks such as image segmentation, object detection, and visual-language understanding. Nevertheless, their potential to enhance 3D scene representation learning remains largely untapped due to the domain gap.

Rethinking 3D Geometric Feature Learning for Neural Reconstruction

Abstract: Recent advances in neural reconstruction using posed image sequences have made remarkable progress. However, due to the lack of depth information, existing volumetric-based techniques simply duplicate 2D image features of the object surface along the entire camera ray.

Disentangling Object Motion and Occlusion for Unsupervised Multi-frame Monocular Depth

Abstract: Conventional self-supervised monocular depth prediction methods are based on a static environment assumption, which leads to accuracy degradation in dynamic scenes due to the mismatch and occlusion problems introduced by object motions.

Advancing Self-Supervised Monocular Depth Learning with Sparse LiDAR

Abstract: Self-supervised monocular depth prediction provides a cost-effective solution to obtain the 3D location of each pixel. However, the existing approaches usually lead to unsatisfactory accuracy, which is critical for autonomous robots.

NARUTO: Neural Active Reconstruction from Uncertain Target Observations

Abstract: We present NARUTO a neural active reconstruction system that combines a hybrid neural representation with uncertainty learning enabling high-fidelity surface reconstruction. Our approach leverages a multi-resolution hash-grid as the mapping backbone chosen for its exceptional convergence speed and capacity to capture high-frequency local features.

FourStr: When Multi-sensor Fusion Meets Semi-supervised Learning

Abstract: This research proposes a novel semi-supervised learning framework FourStr (Four-Stream formed by two two-stream models) that focuses on the improvement of fusion and labeling efficiency for 3D multi-sensor detector. FourStr adopts a multi-sensor single-stage detector named adaptive fusion network (AFNet) as the backbone and trains it through the semi-supervision learning (SSL) strategy Stereo Fusion.

Class-Level Confidence Based 3D Semi-Supervised Learning

Abstract: Recent state-of-the-art method FlexMatch firstly demonstrated that correctly estimating learning status is crucial for semi-supervised learning (SSL). However, the estimation method proposed by FlexMatch does not take into account imbalanced data, which is the common case for 3D semi-supervised learning.

Automated Wall-Climbing Robot for Concrete Construction Inspection

Abstract: Human-made concrete structures require cutting-edge inspection tools to ensure the quality of the construction to meet the applicable building codes and to maintain the sustainability of the aging infrastructure. This paper introduces a wall-climbing robot for metric concrete inspection that can reach difficult-to-access locations with a close-up view for visual data collection and real-time flaws detection and localization.

FocusTR: Focusing on Valuable Feature by Multiple Transformers for Fusing Feature Pyramid on Object Detection

Abstract: The feature pyramid, which is a vital component of the convolutional neural networks, plays a significant role in several perception tasks, including object detection for autonomous driving. However, how to better fuse multi-level and multi-sensor feature pyramids is still a significant challenge, especially for object detection.