Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models

3DV 2024

1UC San Diego, 2ByteDance

Abstract

Zero-shot novel view synthesis (NVS) from a single image is an essential problem in 3D object understanding. While recent approaches that leverage pre-trained generative models can synthesize high-quality novel views from in-the-wild inputs, they still struggle to maintain 3D consistency across different views. In this paper, we present Consistent-1-to-3, which is a generative framework that significantly mitigate this issue. Specifically, we decompose the NVS task into two stages: (i) transforming observed regions to a novel view, and (ii) hallucinating unseen regions. We design a scene representation transformer and view-conditioned diffusion model for performing these two stages respectively. Inside the models, to enforce 3D consistency, we propose to employ epipolor-guided attention to incorporate geometry constraints, and multi-view attention to better aggregate multi-view information. Finally, we design a hierarchy generation paradigm to generate long sequences of consistent views, allowing a full 360 observation of the provided object image. Qualitative and quantitative evaluation over multiple datasets demonstrate the effectiveness of the proposed mechanisms against state-of-the-art approaches.

Video

Pipeline

Given a single or a sparse set of input images, the encoder within the Scene Representation Transformer (SRT) translates the image(s) into a latent scene representation, effectively capturing implicit 3D information. The image rendering unfolds in two stages. The initial stage produces a rough yet geometry-gruonded output by cross-attending the queried pixels to the latent scene representation. Subsequently, these intermediate outputs are taken as input by the view-conditioned diffusion model, resulting in the visually appealing images that exhibit consistency with both the input images and the generated images from different viewpoints.

Results

Comparison

With our well-designed model architecture, we substantially improved 3D consistency.

Ablation Study

Our extensive ablation study demonstrates the effectiveness of our proposed mechanisms.

BibTeX

@article{ye2023consistent,
  title={Consistent-1-to-3: Consistent image to 3d view synthesis via geometry-aware diffusion models},
  author={Ye, Jianglong and Wang, Peng and Li, Kejie and Shi, Yichun and Wang, Heng},
  journal={arXiv preprint arXiv:2310.03020},
  year={2023}
}