TL;DR: Scene 3D reconstruction done at a level of 2.5D image regions, instead of pixels.
Joint camera pose and dense geometry estimation from a set of images or a monocular video remains a challenging problem due to its computational complexity and inherent visual ambiguities. Most dense incremental reconstruction systems operate directly on image pixels and solve for their 3D positions using multi-view geometry cues.
Such pixel-level approaches suffer from ambiguities or violations of multi-view consistency, which can be caused by textureless or specular surfaces.
We address this issue with a new image representation which we call a SuperPrimitive. SuperPrimitives are obtained by splitting images into semantically correlated local regions and enhancing them with estimated surface normal directions, both of which are predicted by state-of-the-art single image neural networks. This provides a local geometry estimate per SuperPrimitive, while their relative positions are adjusted based on multi-view observations.
We demonstrate the versatility of our new representation by addressing three 3D reconstruction tasks: depth completion, few-view structure from motion, and monocular dense visual odometry.
We extract SuperPrimitivies from an image by dividing it into a set of image regions with surface normal directions estimated for each image pixel within the segment. The highlighted SuperPrimitives extracted from the image are visualised by showing their estimated normal and colour maps side by side. Note some of them are scaled either up or down for better viewing. While some of the SuperPrimitives are akin to object-level segmentation, the others tend to represent more low-level image segment.
We visualise the output point cloud from our depth completion method. Sparse input depth points are shown in red.
Our method estimates the dense geometry of a target frame given two unposed supplementary frames. Here, every SuperPrimitive's scale is initialised randomly. and all poses are initialised at identity (blue). We visualise our SuperPrimitive-based Structure-from-Motion optimisation process below.
The current camera pose is shown in green. Keyframe poses are shown in blue. Dense geometry of the latest keyframe is visualised as a point cloud.
@inproceedings{Mazur:etal:CVPR2024, title={{SuperPrimitive}: Scene Reconstruction at a Primitive Level}, author={Kirill Mazur and Gwangbin Bae and Andrew Davison}, booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2024}, }
If you have any questions, please feel free to contact Kirill Mazur