Semantic labelling is highly correlated with geometry and radiance reconstruction, as scene entities with similar shape and appearance are more likely to come from similar classes. Recent implicit neural reconstruction techniques are appealing as they do not require prior training data, but the same fully self-supervised approach is not possible for semantics because labels are human-defined properties. We extend neural radiance fields (NeRF) to jointly encode semantics with appearance and geometry, so that complete and accurate 2D semantic labels can be achieved using a small amount of in-place annotations specific to the scene. The intrinsic multi-view consistency and smoothness of NeRF benefit semantics by enabling sparse labels to efficiently propagate. We show the benefit of this approach when labels are either sparse or very noisy in room-scale scenes. We demonstrate its advantageous properties in various interesting applications such as an efficient scene labelling tool, novel semantic view synthesis, label denoising, super-resolution, label interpolation and multi-view semantic label fusion in visual semantic mapping systems.
Scene-specific implicit 3D semantic representation is obtained by training on colour images and semantic labels with associated poses. Motivated by the enforcement of multi-view consistency and internal coherence within representation, we find that the training process of Semantic-NeRF itself is a multi-view label fusion and propagation process, i.e., fusion via learning. So that it is possible to train Semantic-NeRF efficiently to accurately render the full scene with various types of sparse or imperfect labels. We aim to demonstrate the benefits and promising applications of efficiently learning such a joint 3D representation for semantic labelling and understanding.
Our proposed semantic representation can be trained with labels from only few key-frames. Pixels with high entropy matches well to object boundaries and uncertain/unknown regions.
Semantic-NeRF, trained only with severely corrupted labels, is able to learn a smooth representation and render denoised input labels. While pixel-wise denoising at this level is not a realistic application, however, it is a very challenging task and highlights our key observation that training Semantic-NeRF itself is a fusion process which enables smooth renderings benefiting from the internal consistency and coherence of implicit joint representation.
Low-resolution labelling from light-weight CNNs or manual annotations are less costly to acquire than high-resolution ones.
By supervising Semantic-NeRF with only low-resolution labels, we can accurately super-resolve input labels.
Note that either coarse or sparse training labels use the same amount of information from low-resolution label maps.
(Sparse labels have been zoomed-in 4 times for the ease of visualisation.)
Inspired by the success of label super-resolution, practical interactive annotation from users in form of clicks, strokes or scratches are desirable.
We show that these partial labels can also be propagated to dense scene labelling by Semantic-NeRF.
Even single click per class/frame leads to very competitive semantic rendering of the whole scene.
(Single click labels below have been zoomed-in 9 times for the ease of visualisation.)
Explicit 3D meshes can be extracted from Semantic-NeRF by querying the MLP on dense grids within the scene, and then applying marching cubes. Attached semantic texture is rendered by treating the negative normal direction of mesh vertices as the ray marching direction during volume rendering. Note that Semantic-NeRF is able to predict decent geometry and semantics even in occluded regions (e.g., areas behind the sofa) and fill the holes to some extent in unobserved regions.
We have shown that adding a semantic output to a scene-specific implicit MLP model of geometry and appearance
means that complete and high resolution semantic labels can be generated for a scene when only partial, noisy or
low-resolution semantic supervision is available. This method has practical uses in robotics or other applications
where scene understanding is required in new scenes where only limited labelling is possible.
An interesting direction for future research is interactive labelling, where the continually
training network asks for the new labels which will most resolve semantic ambiguity for the whole scene.