top of page

Understanding 3D Scene Reconstruction from 2D images with CVPR 2024 Workshop

  • Writer: Dr. Daniel Dimanov
    Dr. Daniel Dimanov
  • Apr 11, 2024
  • 2 min read

Inspired by the start of the annual Image Matching Challenge competition in Kaggle (https://www.kaggle.com/competitions/image-matching-challenge-2024), we review some of the newest state-of-the-art methods used by competitors, how they work and synergise, and the general working principle behind them.



The Image Matching Challenge is an annual competition run by the Czech Technical University in Prague, which has some decent prizes and awards for the most dedicated computer vision enthusiasts. The competition is part of the CVPR 2024 Image Matching: Local Features and Beyond workshop. The main task of the challenge is largely centred around a niche image registration task, particularly Structure from Motion (SfM). This year, the competition has several key restrictions incorporated to stimulate creative solutions. The first one is that a competition entry is only valid if it has no internet access, so all the models, data and auxiliary repositories that anyone plans to use need to be pre-uploaded to Kaggle. The second one is that any submission using a non-commercial licensed 3rd party code is not prize-eligible.


Structure from Motion (SfM) is the process of reconstructing a 3D scene and simultaneously obtaining the camera poses of a camera w.r.t. the given scene. This entails creating the entire rigid structure using an arbitrary number of images taken from various viewpoints. The authors of the competition are specifically interested in acquiring the camera poses together with the associated translation and rotation vectors from a determined origin.

ree

Next, we cover the main working principle of the baseline submissions to the competition and the up-to-date common process followed by competitors:

Finding appropriate image pairs

This process can be completed in many different ways, but recently, one of the most popular ways to group similar images (and data in general) has been through the calculation of embedding distance. In this case, DINOv2 offers an off-the-shelf zero-shot solution for generating accurate embeddings on any object present in ImageNet. The images are run through it, and then the pairs are picked based on distance below a certain threshold.

ree
Computing keypoints

As with the last step, there are a plethora of solutions targeted at estimating positions, but one way is to compute some sort of keypoint features. SIFT is a traditional method of achieving this, of course, together with SURF, ORB, BRIEF, BRISK and many more, but some of these methods are copyrighted and thus unfeasible to use in this competition. A more modern and arguably better approach is ALIKED features (Link to paper). ALIKED uses an extraction network with the following architecture:


ALIKED architecture taken from the paper
ALIKED architecture taken from the paper.

To address common issues with neural network descriptors (especially their limitation to provide a sufficient level of geometric invariance), the authors design a novel Sparse Deformable Descriptor Head that learns the deformable positions of each key point and constructs corresponding deformable descriptors. The extraction of these key points is crucial as the next steps use them in order to match the key points and estimate the needed transformations.


Match keypoints and compute distances

Given the extracted key points from the last step of the pair images, they can be matched between images, and then the distance between them can be estimated. As with the other steps, there are multiple ways of completing the matching, and arguably, one of the best is to use SuperGlue (Link to paper). SuperGlue, however, is covered by a restrictive licence agreement, so it cannot be used. Instead, the open-sourced LightGlue (Link to paper) provides similar capabilities without the licensing issue, and according to the paper, it is much more efficient at utilising computing power, which is critical as Kaggle has limited resources available for evaluating the solutions. 


The architecture of Lightglue is as follows:

LightGlube architecture
LightGlube architecture (figure taken from the paper): "Given a pair of input local features (d, p), each layer augments the visual descriptors (•,•) with context based on self- and cross-attention units with positional encoding ⊙. A confidence classifier c helps decide whether to stop the inference. If few points are confident, the inference proceeds to the next layer but we prune points that are confidently unmatchable. Once a confident state if reached, LightGlue predicts an assignment between points based on their pariwise similarity and unary matchability."

Below are some example figures that show how it works:

ree


Refine Matches using RANSAC

The keypoint correspondences are computed using the acquired descriptors, so there are some noisy matches that need to be filtered out. In order to do that efficiently, let's first present the matching in epipolar geometry terms. The fundamental matrix is a 3x3 matrix that is used to relate stereo images to a unified plane of image coordinates, where x and x' are the corresponding points in the image pair. There is a function Fx that describes a line where all x' points must lie, hence x'Fx=0.

ree

In the context of this competition, suppose we have discovered x and x' and have matched them using the descriptors. Then, we can use this epipolar constraint and the eight-point algorithm to recover the fundamental matrix.


A popular approach to using the algorithm to filter outlier keypoints is RANSAC, which iteratively chooses points and calculates which combination of x' points best lie on the Fx line. A process that can be described as:

ree
3D Scene Construction and camera positions
ree

After calculating the fundamental matrix, which is used to find correspondences between pairs of images and to estimate the camera's motion and position, methods like COLMAP transition into the core of the 3D reconstruction process; this phase involves building a coherent 3D model from the keypoints matched across image pairs. Here's a simplified overview of how it happens:

Triangulation

Once the camera poses (positions and orientations) are estimated, COLMAP uses triangulation to generate 3D points from the key points. Triangulation works by finding the point in 3D space corresponding to the intersection of two rays originating from two camera centres and passing through the matched feature points in each image. This process is repeated for many matched key points across multiple images, accumulating a cloud of 3D points that represent the scene structure.


ree

Bundle Adjustment

With an initial set of 3D points and camera poses, COLMAP performs bundle adjustments to refine these estimates. Bundle adjustment is a process of simultaneously refining the 3D coordinates of the points, along with the camera parameters (positions, orientations, and intrinsic parameters), to minimize the reprojection error—the difference between the observed key point positions in the images and the predicted positions based on the 3D point estimates and camera parameters. This optimization is critical for improving the accuracy of the 3D model.

Factors of optimisation during bundle adjustment (Figure taken from Colmap-PCD paper)
Factors of optimisation during bundle adjustment (Figure taken from Colmap-PCD paper)

Link to Colmap-PCD paper

Bundle adjustment visualisation (Taken from COLMAP-PCD paper https://arxiv.org/pdf/2310.05504.pdf)
Incremental or Global Reconstruction

COLMAP can perform either incremental or global reconstruction. In incremental reconstruction, the software starts with a small set of images, gradually adding one image at a time and integrating it into the 3D model by matching, triangulating new points, and performing bundle adjustment. This robust approach allows for detailed error management but can be computationally intensive. In contrast, global reconstruction approaches solve for all camera poses simultaneously using a global optimization framework, which can be faster but might be less robust to outliers.



Dense Reconstruction (optional but good for validation)

For applications requiring more detailed models, COLMAP can generate a dense point cloud from the sparse 3D model. It does this by using multi-view stereo algorithms that estimate the depth of each pixel in the images, producing a dense cloud of 3D points that captures finer details of the scene than the sparse point cloud.

Mesh Generation and Texturing (optional)

As a final step, a surface mesh can be generated from the dense point cloud, providing a solid 3D model of the scene. This mesh can be textured using the original images, resulting in a photorealistic 3D model.

This process, starting from the calculation of the fundamental matrix to the final reconstruction, showcases the complexity and sophistication of the algorithms underlying COLMAP. It balances the need for accuracy in the 3D model with the computational challenges of processing potentially large sets of images and data points.

We wish you good luck and happy coding if you are participating in the competition; if not, we hope you've learnt something new today. If you are interested in a solution that requires a similar process or any of the individual steps, please don't hesitate to contact us.


 
 
 

©2025 by CountingLab. 

  • Facebook
  • Twitter
  • LinkedIn
bottom of page