Learning Intrinsic Neural Radiance Fields for Editable Novel View Synthesis

Arxiv 2022

Weicai Ye*, Shuo Chen*, Chong Bao, Hujun Bao, Marc Pollefeys, Zhaopeng Cui, Guofeng Zhang†

1State Key Lab of CAD & CG, Zhejiang University   2ETH Zurich   3Microsoft  
* denotes equal contribution
† denotes corresponding author


We present intrinsic neural radiance fields, dubbed IntrinsicNeRF, that introduce intrinsic decomposition into the NeRF-based~\cite{mildenhall2020nerf} neural rendering method and can perform editable novel view synthesis in room-scale scenes while existing inverse rendering combined with neural rendering methods~\cite{zhang2021physg, zhang2022modeling} can only work on object-specific scenes. Given that intrinsic decomposition is a fundamentally ambiguous and under-constrained inverse problem, we propose a novel distance-aware point sampling and adaptive reflectance iterative clustering optimization method that enables IntrinsicNeRF with traditional intrinsic decomposition constraints to be trained in an unsupervised manner, resulting in temporally consistent intrinsic decomposition results. To cope with the problem of different adjacent instances of similar reflectance in a scene being incorrectly clustered together, we further propose a hierarchical clustering method with coarse-to-fine optimization to obtain a fast hierarchical indexing representation. It enables compelling real-time augmented reality applications such as scene recoloring, material editing, and illumination variation. Extensive experiments on Blender Object and Replica Scene demonstrate that we can obtain high-quality, consistent intrinsic decomposition results and high-fidelity novel view synthesis even for challenging sequences.



Given a set of multi-view images with camera pose, IntrinsicNeRF is able to factorize the scene into the temporally consistent components: reflectance, shading and residual layers. The decomposition can support real-time augmented video applications such as scene recoloring, material editing, illumination variation, and editable novel view synthesis.



IntrinsicNeRF takes the sampled spatial coordinate point and direction as input, and outputs the density, reflectance, shading, and residual term. The semantic branch is optional. Unsupervised Prior and Reflectance Clustering are exploited to train the IntrinsicNeRF in an unsupervised manner. With the semantic branch, we can obtain the hierarchical clustering and indexing representation which supports real-time editing.

Adaptive Reflectance Iterative Clustering


The color of the reflectance pixels is first converted to better cluster reflectances and then clustered with mean shift algorithms. The voxel grid filter is performed to accelerate the processing of the cluster operation G, which considers the category of the nearest anchor points as the category of each point and saves the category of the center point as the target clustered category.

Hierarchical Reflectance Clustering and Indexing


Given the reflectance value of each pixel and the corresponding semantic label, hierarchical clustering operation first query the semantics of each pixel, and output the results of the clustering operation. The clustering information of each pixel is stored in a tree structure, which yields a hierarchical indexing representation.

Applicability: Scene Recoloring

The reflectance predicted by the IntrinsicNeRF network is saved as [Semantic category, reflectance category], and the last iteration of hierarchical iterative clustering method will save the reflectance categories in all semantic categories of the whole scene. Therefore, the [Semantic category, reflectance category] label can be used to quickly find the reflectance value of each pixel point. Based on this representation, we can perform scene recoloring in real-time, just by simply modifying the color of a certain reflectance category, the reflectance values of all pixels in the video belonging to that category can be modified at the same time, and then the edited video can be reconstructed using the modified reflectance with the original shading and residual through Equation 2. The edited scene can perform novel view synthesis with the Play button ($\triangleright$).

Applicability: Material Editing

We can editing the surface materials by manipulating the shading layer, defining a simple mapping function between the original and the new shading image. In our video editing software, we use the tone mapping function and the user only needs to choose the ratio by adjusting the slide bar, and the mapping function will work directly on the current shading image, which will be recombined with the reflectance and residual image to form a new image. We can make the plastic material (such as lego, hotdog), wooden (such as chair), tile (such as ficus and jugs) to like metallic materials. We can also make the scene appear shinier or velvet.

Applicability: Illumination Variation

Since our IntrinsicNeRF can decompose residual terms besides Lambertian assumptions, which may be properties such as specular illumination, we can adjust its overall brightness directly through the sliding buttons of the video editing software. We can enhance the light or diminish it, to see the effect of different light intensities.

Applicability: Editable Novel View Synthesis

Our IntrinsicNeRF gives the NeRF the ability to model additional fundamental properties of the scene, and the original novel view synthesis functionality is retained. The effects of our video editing application above such as scene recoloring can be applied to the editable novel view synthesis, maintaining consistency.

Video Editing Software


We have also developed a convenient video augmented editing software, to facilitate the user to perform object or scene editing.

Overview Video


    title={IntrinsicNeRF: Learning Intrinsic Neural Radiance Fields for Editable Novel View Synthesis},
    author={Ye, Weicai and Chen, Shuo and Bao, Chong and Bao, Hujun and Pollefeys, Marc and Cui, Zhaopeng and Zhang, Guofeng},


The authors thank Yuanqing Zhang for providing us with the pre-trained model of InvRender, Jiarun Liu for reproducing the results of PhySG and Hai Li, Jundan Luo for proofreading the paper. This work was partially supported by NSF of China (No. 61932003) and ZJU-SenseTime Joint Lab of 3D Vision. Weicai Ye was partially supported by China Scholarship Council (No. 202206320316).