Easy3D-Labels: Supervising Semantic Occupancy Estimation with 3D Pseudo-Ground-Truth Labels
Description
Self-supervised semantic occupancy estimation aims to estimate a discretized 3D voxel grid of semantic labels for the surrounding scene. Because the setting is self-supervised, conventional 3D ground-truth annotations cannot be used, and supervision instead relies on 2D pseudo-labels produced by Visual Foundation Models. Given this 2D supervision space, methods such as novel view synthesis, cross-view rendering, and depth estimation are employed to address depth and semantic ambiguity. However, these techniques often introduce substantial computational and memory overhead during training, particularly in the case of novel view synthesis. To address these limitations, we propose Easy3D-Labels, which are 3D pseudo-ground-truth labels that can be employed for model supervision on the Occ3D-nuScenes dataset. They are generated using the foundation models Grounded-SAM and Metric3Dv2, which provide semantic maps and depth maps, respectively. We incorporate temporal information to densify the labels, which provide more accurate final estimations. Easy3D-Labels can be integrated into existing models either as a supplementary loss or as the primary supervision signal, which is explored in our paper, EasyOcc: https://arxiv.org/abs/2509.26087. Through the release of this dataset, we underscore the importance of the loss computation space in self-supervised learning for holistic scene understanding and enable more robust and accurate 3D scene estimation.
Files
Steps to reproduce
Semantic maps are generated by running Grounded-SAM on the nuScenes dataset, and these semantic maps are provided in the OccNeRF repository: https://github.com/LinShan-Bin/OccNeRF Depth maps are run locally using Metric3Dv2 (Giant) on the nuScenes dataset: https://github.com/YvanYin/Metric3D For each sample, we project the semantic maps into 3D point clouds using the depth maps, and the aggregation of previous samples densifies this point cloud. The 3D point clouds are then voxelized according to the voxel bounds of the Occ3D-nuScenes dataset. This process is further detailed in our paper on arXiv: https://arxiv.org/abs/2509.26087
Institutions
- University of Limerick
Categories
Funders
- Taighde Éireann – Research IrelandGrant ID: 18/CRT/6049