FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models

Abstract

Recent works on generalizable NeRFs have shown promising results on novel view synthesis from single or few images. However, such models have rarely been applied on other downstream tasks beyond synthesis such as semantic understanding and parsing. In this paper, we propose a novel framework named FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision foundation models (e.g., DINO, Latent Diffusion). FeatureNeRF leverages 2D pre-trained foundation models to 3D space via neural rendering, and then extract deep features for 3D query points from NeRF MLPs. Consequently, it allows to map 2D images to continuous 3D semantic feature volumes, which can be used for various downstream tasks. We evaluate FeatureNeRF on tasks of 2D/3D semantic keypoint transfer and 2D/3D object part segmentation. Our extensive experiments demonstrate the effectiveness of FeatureNeRF as a generalizable 3D semantic feature extractor.

Network Architecture

Given a single image as input, FeatureNeRF adopts an encoder to extract the image feature , and then concatenate it with the query point as well as the view direction as the inputs for NeRF MLPs. Apart from density and color, we add two MLP branches to predict the feature vector and coordinate, which are supervised by two novel loss terms. Besides, we propose to extract internal NeRF feature as 3D-consistent feature representation.

Results

Keypoints Transfer

FeatureNeRF can also transfer semantic keypoints to both novel objects and novel views.

Part Co-segmentation

FeatureNeRF can also transfer part segmentation labels to novel objects or views in both 2D and 3D domain.

BibTeX

@inproceedings{ye2023featurenerf, title={FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models}, author={Ye, Jianglong and Wang, Naiyan and Wang, Xiaolong}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages={8962--8973}, year={2023} }