Dr. Daniel Dimanov
- Apr 11
- 2 min read

Understanding 3D Scene Reconstruction from 2D images with CVPR 2024 Workshop

Inspired by the start of the annual Image Matching Challenge competition in Kaggle (https://www.kaggle.com/competitions/image-matching-challenge-2024), we review some of the newest state-of-the-art methods used by competitors, how they work and synergise, and the general working principle behind them.

The Image Matching Challenge is an annual competition run by the Czech Technical University in Prague, which has some decent prizes and awards for the most dedicated computer vision enthusiasts. The competition is part of the CVPR 2024 Image Matching: Local Features and Beyond workshop. The main task of the challenge is largely centred around a niche image registration task, particularly Structure from Motion (SfM). This year, the competition has several key restrictions incorporated to stimulate creative solutions. The first one is that a competition entry is only valid if it has no internet access, so all the models, data and auxiliary repositories that anyone plans to use need to be pre-uploaded to Kaggle. The second one is that any submission using a non-commercial licensed 3rd party code is not prize-eligible.

Structure from Motion (SfM) is the process of reconstructing a 3D scene and simultaneously obtaining the camera poses of a camera w.r.t. the given scene. This entails creating the entire rigid structure using an arbitrary number of images taken from various viewpoints. The authors of the competition are specifically interested in acquiring the camera poses together with the associated translation and rotation vectors from a determined origin.

Next, we cover the main working principle of the baseline submissions to the competition and the up-to-date common process followed by competitors:

Finding appropriate image pairs

This process can be completed in many different ways, but recently, one of the most popular ways to group similar images (and data in general) has been through the calculation of embedding distance. In this case, DINOv2 offers an off-the-shelf zero-shot solution for generating accurate embeddings on any object present in ImageNet. The images are run through it, and then the pairs are picked based on distance below a certain threshold.

Computing keypoints

Match keypoints and compute distances

Refine Matches using RANSAC

3D Scene Construction and camera positions

We wish you good luck and happy coding if you are participating in the competition; if not, we hope you've learnt something new today. If you are interested in a solution that requires a similar process or any of the individual steps, please don't hesitate to contact us.