In-Place Scene Labelling and Understanding with Implicit Scene Representation

Semantic labelling is highly correlated with geometry and radiance reconstruction, as scene entities with similar shape and appearance are more likely to come from similar classes. Recent implicit neural reconstruction techniques are appealing as they do not require prior training data, but the same fully self-supervised approach is not possible for semantics because labels are human-defined properties.

We extend neural radiance fields (NeRF) to jointly encode semantics with appearance and geometry, so that complete and accurate 2D semantic labels can be achieved using a small amount of in-place annotations specific to the scene. The intrinsic multi-view consistency and smoothness of NeRF benefit semantics by enabling sparse labels to efficiently propagate. We show the benefit of this approach when labels are either sparse or very noisy in room-scale scenes. We demonstrate its advantageous properties in various interesting applications such as an efficient scene labelling tool, novel semantic view synthesis, label denoising, super-resolution, label interpolation and multi-view semantic label fusion in visual semantic mapping systems.

Applications

Scene-specific implicit 3D semantic representation is obtained by training on colour images and semantic labels with associated poses. Motivated by the enforcement of multi-view consistency and internal coherence within representation, we find that the training process of Semantic-NeRF itself is a multi-view label fusion and propagation process, i.e., fusion via learning. So that it is possible to train Semantic-NeRF efficiently to accurately render the full scene with various types of sparse or imperfect labels.

We aim to demonstrate the benefits and promising applications of efficiently learning such a joint 3D representation for semantic labelling and understanding.

Semantic View Synthesis

Our proposed semantic representation can be trained with labels from only few key-frames. Pixels with high entropy matches well to object boundaries and uncertain/unknown regions.

Replica Room_2 (Using 20% Label of Full Sequence)

Semantic Label Denoising

Semantic-NeRF, trained only with severely corrupted labels, is able to learn a smooth representation and render denoised input labels. While pixel-wise denoising at this level is not a realistic application, however, it is a very challenging task and highlights our key observation that training Semantic-NeRF itself is a fusion process which enables smooth renderings benefiting from the internal consistency and coherence of implicit joint representation.

Replica Office_0

Replica Room_1

ScanNet Scene0010_00

ScanNet Scene0012_00

Semantic Label Super-Resolution

Low-resolution labelling from light-weight CNNs or manual annotations are less costly to acquire than high-resolution ones. By supervising Semantic-NeRF with only low-resolution labels, we can accurately super-resolve input labels.
Note that either coarse or sparse training labels use the same amount of information from low-resolution label maps. (Sparse labels have been zoomed-in 4 times for the ease of visualisation.)

Replica Office_3 Coarse

Replica Office_3 Sparse

ScanNet Scene0088_00 Coarse

ScanNet Scene0088_00 Sparse

Label Propagation (Interactive Segmentation)

Inspired by the success of label super-resolution, practical interactive annotation from users in form of clicks, strokes or scratches are desirable. We show that these partial labels can also be propagated to dense scene labelling by Semantic-NeRF.
Even single click per class/frame leads to very competitive semantic rendering of the whole scene. (Single click labels below have been zoomed-in 9 times for the ease of visualisation.)

Replica Room_0 Single-Click

Replica Room_0 1% Label

Replica Room_0 5% Label

Semantic 3D Reconstruction from Posed Images

Explicit 3D meshes can be extracted from Semantic-NeRF by querying the MLP on dense grids within the scene, and then applying marching cubes. Attached semantic texture is rendered by treating the negative normal direction of mesh vertices as the ray marching direction during volume rendering. Note that Semantic-NeRF is able to predict decent geometry and semantics even in occluded regions (e.g., areas behind the sofa) and fill the holes to some extent in unobserved regions.

Reconstructed mesh of Replica Room_0 with 256³ grids.

Conclusion

We have shown that adding a semantic output to a scene-specific implicit MLP model of geometry and appearance means that complete and high resolution semantic labels can be generated for a scene when only partial, noisy or low-resolution semantic supervision is available. This method has practical uses in robotics or other applications where scene understanding is required in new scenes where only limited labelling is possible.
An interesting direction for future research is interactive labelling, where the continually training network asks for the new labels which will most resolve semantic ambiguity for the whole scene.

Paper

Bibtex

@inproceedings{Zhi:etal:ICCV2021, author = {Shuaifeng Zhi and Tristan Laidlow and Stefan Leutenegger and Andrew Davison}, title = {In-Place Scene Labelling and Understanding with Implicit Scene Representation}, booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)}, year={2021} }