In-Place Scene Labelling and Understanding with Implicit Scene Representation

ICCV 2021 (Oral)


Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, Andrew Davison

Dyson Robotics Laboratory, Imperial College London

Paper (Arxiv)
Video (Youtube)
Video (bilibili)
Code
Data

Semantic labelling is highly correlated with geometry and radiance reconstruction, as scene entities with similar shape and appearance are more likely to come from similar classes. Recent implicit neural reconstruction techniques are appealing as they do not require prior training data, but the same fully self-supervised approach is not possible for semantics because labels are human-defined properties.

We extend neural radiance fields (NeRF) to jointly encode semantics with appearance and geometry, so that complete and accurate 2D semantic labels can be achieved using a small amount of in-place annotations specific to the scene. The intrinsic multi-view consistency and smoothness of NeRF benefit semantics by enabling sparse labels to efficiently propagate. We show the benefit of this approach when labels are either sparse or very noisy in room-scale scenes. We demonstrate its advantageous properties in various interesting applications such as an efficient scene labelling tool, novel semantic view synthesis, label denoising, super-resolution, label interpolation and multi-view semantic label fusion in visual semantic mapping systems.

Applications


Scene-specific implicit 3D semantic representation is obtained by training on colour images and semantic labels with associated poses. Motivated by the enforcement of multi-view consistency and internal coherence within representation, we find that the training process of Semantic-NeRF itself is a multi-view label fusion and propagation process, i.e., fusion via learning. So that it is possible to train Semantic-NeRF efficiently to accurately render the full scene with various types of sparse or imperfect labels.

We aim to demonstrate the benefits and promising applications of efficiently learning such a joint 3D representation for semantic labelling and understanding.

Semantic View Synthesis


Our proposed semantic representation can be trained with labels from only few key-frames. Pixels with high entropy matches well to object boundaries and uncertain/unknown regions.

Replica Room_2 (Using 20% Label of Full Sequence)

Semantic Label Denoising


Semantic-NeRF, trained only with severely corrupted labels, is able to learn a smooth representation and render denoised input labels. While pixel-wise denoising at this level is not a realistic application, however, it is a very challenging task and highlights our key observation that training Semantic-NeRF itself is a fusion process which enables smooth renderings benefiting from the internal consistency and coherence of implicit joint representation.

Replica Office_0
Replica Room_1

ScanNet Scene0010_00
ScanNet Scene0012_00

Semantic Label Super-Resolution


Low-resolution labelling from light-weight CNNs or manual annotations are less costly to acquire than high-resolution ones. By supervising Semantic-NeRF with only low-resolution labels, we can accurately super-resolve input labels.
Note that either coarse or sparse training labels use the same amount of information from low-resolution label maps. (Sparse labels have been zoomed-in 4 times for the ease of visualisation.)

Replica Office_3 Coarse
Replica Office_3 Sparse

ScanNet Scene0088_00 Coarse
ScanNet Scene0088_00 Sparse

Label Propagation (Interactive Segmentation)


Inspired by the success of label super-resolution, practical interactive annotation from users in form of clicks, strokes or scratches are desirable. We show that these partial labels can also be propagated to dense scene labelling by Semantic-NeRF.
Even single click per class/frame leads to very competitive semantic rendering of the whole scene. (Single click labels below have been zoomed-in 9 times for the ease of visualisation.)

Replica Room_0 Single-Click
Replica Room_0 1% Label
Replica Room_0 5% Label

Semantic 3D Reconstruction from Posed Images


Explicit 3D meshes can be extracted from Semantic-NeRF by querying the MLP on dense grids within the scene, and then applying marching cubes. Attached semantic texture is rendered by treating the negative normal direction of mesh vertices as the ray marching direction during volume rendering. Note that Semantic-NeRF is able to predict decent geometry and semantics even in occluded regions (e.g., areas behind the sofa) and fill the holes to some extent in unobserved regions.


Reconstructed mesh of Replica Room_0 with 2563 grids.

Conclusion


We have shown that adding a semantic output to a scene-specific implicit MLP model of geometry and appearance means that complete and high resolution semantic labels can be generated for a scene when only partial, noisy or low-resolution semantic supervision is available. This method has practical uses in robotics or other applications where scene understanding is required in new scenes where only limited labelling is possible.
An interesting direction for future research is interactive labelling, where the continually training network asks for the new labels which will most resolve semantic ambiguity for the whole scene.

Paper


Bibtex


@inproceedings{Zhi:etal:ICCV2021, author = {Shuaifeng Zhi and Tristan Laidlow and Stefan Leutenegger and Andrew Davison}, title = {In-Place Scene Labelling and Understanding with Implicit Scene Representation}, booktitle = {Proceedings of the International Conference on Computer Vision (ICCV)}, year={2021} }

The website template was borrowed from SIREN