Track: Orals 3A 3D from single view

Thu 20 June 9:00 - 9:18 PDT

Oral #1

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Bingxin Ke · Anton Obukhov · Shengyu Huang · Nando Metzger · Rodrigo Caye Daudt · Konrad Schindler

Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Source code will be made publicly available.

Thu 20 June 9:18 - 9:36 PDT

Oral #2

EscherNet: A Generative Model for Scalable View Synthesis

Xin Kong · Shikun Liu · Xiaoyang Lyu · Marwan Taher · Xiaojuan Qi · Andrew J. Davison

We introduce EscherNet, a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding, allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality, flexibility, and scalability in view synthesis --- it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU, despite being trained with a fixed number of 3 reference views to 3 target views. As a result, EscherNet not only addresses zero-shot novel view synthesis, but also naturally unifies single- and multi-image 3D reconstruction, combining these diverse tasks into a single, cohesive framework. Our extensive experiments demonstrate that EscherNet achieves state-of-the-art performance in multiple benchmarks, even when compared to methods specifically tailored for each individual problem. This remarkable versatility opens up new directions for designing scalable neural architectures for 3D vision. Project page: https://kxhit.github.io/EscherNet.

Thu 20 June 9:36 - 9:54 PDT

Oral #3

WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects Under Occlusion

Khiem Vuong · N. Dinesh Reddy · Robert Tamburo · Srinivasa G. Narasimhan

Current methods for 2D and 3D object understanding struggle with severe occlusions in busy urban environments, partly due to the lack of large-scale labeled ground-truth annotations for learning occlusion. In this work, we introduce a novel framework for automatically generating a large, realistic dataset of dynamic objects under occlusions using freely available time-lapse imagery. By leveraging off-the-shelf 2D (bounding box, segmentation, keypoint) and 3D (pose, shape) predictions as pseudo-groundtruth, unoccluded 3D objects are identified automatically and composited into the background in a clip-art style, ensuring realistic appearances and physically accurate occlusion configurations. The resulting clip-art image with pseudo-groundtruth enables efficient training of object reconstruction methods that are robust to occlusions. Our method demonstrates significant improvements in both 2D and 3D reconstruction, particularly in scenarios with heavily occluded objects like vehicles and people in urban scenes.

Thu 20 June 9:54 - 10:12 PDT

Oral #4

Diffusion-FOF: Single-View Clothed Human Reconstruction via Diffusion-Based Fourier Occupancy Field

Yuanzhen Li · Fei LUO · Chunxia Xiao

Fourier occupancy field-based human reconstruction is a simple method that transforms the occupancy function of the 3D model into a multichannel 2D vector field. However, accurately estimating high-frequency information of the FOF is challenging, leading to geometric distortion and discontinuity. To this end, we propose a wavelet-based diffusion model to predict the FOF, extracting more high-frequency information and enhancing geometric stability. Our method comprises two interconnected tasks: texture estimation and geometry prediction. Initially, we predict the back-side texture from the input image, incorporating a style consistency constraint between the predicted back-side image and the original input image. To enhance network training effectiveness, we adopt a Siamese network training strategy. We introduce a wavelet-based diffusion model for geometric estimation to generate the Fourier occupancy field. First, we utilize an image encoder module to extract the features of the two images as conditions. Subsequently, we employ a conditional diffusion model to estimate the Fourier occupancy field in the wavelet domain. The predicted wavelet coefficients are then converted into the Fourier occupancy field using the inverse wavelet transform (IWT). A refinement network refines the predicted Fourier occupancy field with image features as guidance, yielding the final output. Through both quantitative and qualitative experiments, we demonstrate the state-of-the-art performance of our method in reconstructing single-view clothed human subjects.

Thu 20 June 10:12 - 10:30 PDT

Oral #5

Rethinking Inductive Biases for Surface Normal Estimation

Gwangbin Bae · Andrew J. Davison

Despite the growing demand for accurate surface normal estimation models, existing methods use general-purpose dense prediction models, adopting the same inductive biases as other tasks. In this paper, we discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) encode the relationship between neighboring surface normals by learning their relative rotation. The proposed method can generate crisp - yet, piecewise smooth - predictions for challenging in-the-wild images of arbitrary resolution and aspect ratio. Compared to a recent ViT-based state-of-the-art model, our method shows a stronger generalization ability, despite being trained on an orders of magnitude smaller dataset.

Main Navigation

Oral Session

Orals 3A 3D from single view

Summit Ballroom

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

EscherNet: A Generative Model for Scalable View Synthesis

WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects Under Occlusion

Diffusion-FOF: Single-View Clothed Human Reconstruction via Diffusion-Based Fourier Occupancy Field

Rethinking Inductive Biases for Surface Normal Estimation