Learning Object-Compositional Neural Radiance Field
for Editable Scene Rendering

ICCV 2021


Bangbang Yang1, Yinda Zhang2, Yinghao Xu3, Yijin Li1, Han Zhou1, Hujun Bao1, Guofeng Zhang1, Zhaopeng Cui1

1State Key Lab of CAD & CG, Zhejiang University    2Google 3The Chinese University of Hong Kong

Abstract


Implicit neural rendering techniques have shown promising results for novel view synthesis. However, existing methods usually encode the entire scene as a whole, which is generally not aware of the object identity and limits the ability to the high-level editing tasks such as moving or adding furniture. In this paper, we present a novel neural scene rendering system, which learns an object-compositional neural radiance field and produces realistic rendering with editing capability for a clustered and real-world scene. Specifically, we design a novel two-pathway architecture, in which the scene branch encodes the scene geometry and appearance, and the object branch encodes each standalone object conditioned on learnable object activation codes. To survive the training in heavily cluttered scenes, we propose a scene-guided training strategy to solve the 3D space ambiguity in the occluded regions and learn sharp boundaries for each object. Extensive experiments demonstrate that our system not only achieves competitive performance for static scene novel-view synthesis, but also produces realistic rendering for object-level editing.


Scene Branch and Object Branch


We design a two-pathway architecture for object-compositional neural radiance field. The scene branch renders the entire view of the scene, and also render the background for editable scene rendering. The object branch renders each standalone object conditioned on the object activation code.


Animation Pipeline of Editable Scene Rendering


To obtain a view with object manipulation, we jointly render the transformed objects from the conditioned object branch and the surrounding background from the scene branch.


Examples on the ToyDesk Dataset



Examples on the ScanNet Dataset



Comparison of Scene Editing


* Dai P, Zhang Y, Li Z, et al. Neural Point Cloud Rendering via Multi-Plane Projection[C]//in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 7830-7839.


Framework Overview


Architechture

We design a two-pathway architecture for object-compositional neural radiance field. The scene branch takes the spatial coordinate $\mathbf{x}$, the interpolated scene voxel features $\boldsymbol{f}_{scn}$ at $\mathbf{x}$ and the ray direction $\mathbf{d}$ as input, and output the color $\mathbf{c}_{scn}$ and opacity $\sigma_{scn}$ of the scene. The object branch takes additional object voxel features $\boldsymbol{f}_{obj}$ as well a a object activation code $\boldsymbol{l}_{obj}$ to condition the output only contains the color $\mathbf{c}_{obj}$ and opacity $\sigma_{obj}$ for a specific object at its original location with everything else removed.


Overview Video



Citation


@inproceedings{yang2021objectnerf,
    title={Learning Object-Compositional Neural Radiance Field for Editable Scene Rendering},
    author={Yang, Bangbang and Zhang, Yinda and Xu, Yinghao and Li, Yijin and Zhou, Han and Bao, Hujun and Zhang, Guofeng and Cui, Zhaopeng},
    booktitle = {International Conference on Computer Vision ({ICCV})},
    month = {October},
    year = {2021},
}

Acknowledgements


We thank Hanqing Jiang, Liyang Zhou and Jiaming Sun for their kind help in scene reconstruction and annotation for the ToyDesk dataset.