US-20260127750-A1 - STRUCTURE FROM MOTION ENHANCEMENTS USING GENERALIZED CAMERA MODEL AND MOTION PARAMETRIZATION

US20260127750A1US 20260127750 A1US20260127750 A1US 20260127750A1US-20260127750-A1

Abstract

This disclosure provides systems, methods, and devices that utilize machine learning models to determine corresponding spatial positions and motion trajectories for images. In one aspect, a method is provided that includes receiving an image of a scene captured by a camera; determining, with a first machine learning model, a plurality of position values relative to the camera for at least a subset of pixels within the image; and training a second machine learning model based on the determined position values. The method further includes determining basis trajectories based on movement of the positions relative to previous image frames, and determining, for each of a subset of pixels, a movement trajectory relative to the previous frames as a weighted combination of the basis trajectories. These techniques can be employed as part of a structure-from-motion pipeline to estimate multi-view three-dimensional geometry, camera poses, and per-pixel object motion. Additional aspects are provided.

Inventors

Andrea Porfiri Dal Cin
Georgi Dikov
Jihong JU
Mohsen Ghafoorian

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260507
Application Date: 20250730

Claims (20)

1 . A system comprising: a processor; and a memory storing instructions which, when executed by the processor, cause the processor to perform operations including: receiving a first image of a scene captured by a camera; determining positions for pixels of the first image relative to the camera; determining basis trajectories based on movement of the positions relative to at least one previous image frame; and determining, for each of at least a subset of the pixels, a movement trajectory relative to the at least one previous image frame as a weighted combination of the basis trajectories.
2 . The system of claim 1 , wherein the basis trajectories indicate rigid body motion.
3 . The system of claim 2 , wherein the basis trajectories include three rotation angles and a three-dimensional translation vector.
4 . The system of claim 1 , wherein the movement trajectory is determined as a weighted linear combination of the basis trajectories.
5 . The system of claim 1 , wherein the movement trajectories are determined by a third machine learning model that is trained to determine the weights based on the positions for pixels of the first image and previous positions for pixels of the at least one previous image frame.
6 . The system of claim 5 , wherein the third machine learning model is a multi-layer perceptron (MLP) model.
7 . A method comprising: receiving a first image of a scene captured by a camera; determining positions for pixels of the first image relative to the camera; determining basis trajectories based on movement of the positions relative to at least one previous image frame; and determining, for each of at least a subset of the pixels, a movement trajectory relative to the at least one previous image frame as a weighted combination of the basis trajectories.
8 . The method of claim 7 , wherein the basis trajectories indicate rigid body motion.
9 . The method of claim 8 , wherein the basis trajectories include three rotation angles and a three-dimensional translation vector.
10 . The method of claim 7 , wherein the movement trajectory is determined as a weighted linear combination of the basis trajectories.
11 . The method of claim 7 , wherein the movement trajectories are determined by a third machine learning model that is trained to determine the weights based on the positions for pixels of the first image and previous positions for pixels of the at least one previous image frame.
12 . The method of claim 11 , wherein the third machine learning model is a multi-layer perceptron (MLP) model.
13 . A method comprising: receiving an image of a scene captured by a camera; determining, with a first machine learning model, a plurality of position values relative to the camera for at least a first subset of pixels within the image; and training a second machine learning model based on the determined position values.
14 . The method of claim 13 , wherein the first machine learning model is trained to implement a first projection function that maps image coordinates to corresponding position values.
15 . The method of claim 14 , wherein the first machine learning model is an invertible multi-layer perceptron (MLP) model.
16 . The method of claim 13 , wherein determining, with the first machine learning model, the plurality of position values comprises, for each respective pixel of at least the first subset of the pixels: determining, with the first machine learning model, a first set of position values for a second subset of the pixels; determining a first position value based on the first set of position values and the respective pixel; and determining a respective position for the respective pixel based on the first position value.
17 . The method of claim 16 , wherein determining the first set of position values comprises querying the first machine learning model for the second subset of the pixels and storing the received values in a lookup table.
18 . The method of claim 16 , wherein the second subset of the pixels has fewer pixels than the first subset of the pixels.
19 . The method of claim 16 , wherein determining the first position value comprises: determining a distance measure between the respective pixel and the second subset of pixels; determining two or more pixels from the second subset of pixels with the smallest distance measure; and determining the first position value by interpolating between corresponding position values for the two or more pixels.
20 . The method of claim 16 , wherein determining the respective position for the respective pixel comprises: determining, with the first machine learning model, a second set of position values for a third subset of the pixels, wherein the third subset of the pixels are located near the respective pixel; determining a second position value based on the second set of position values and the respective pixel; and determining the respective position for the respective pixel based on the second position value.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application No. 63/717,689, entitled, “STRUCTURE FROM MOTION ENHANCEMENTS USING GENERALIZED CAMERA MODEL AND MOTION PARAMETRIZATION” filed on Nov. 7, 2024, which is expressly incorporated by reference herein in its entirety. TECHNICAL FIELD Aspects of the present disclosure relate generally to machine learning techniques, and more particularly, to methods and systems suitable for structure from motion techniques. INTRODUCTION Machine learning techniques encompass a diverse array of computational methodologies designed to enable systems to learn from and make predictions or decisions based on data. These techniques typically involve the construction of models, algorithms, or neural network architectures that can infer patterns, trends, or structures within large datasets without explicit programming for each task. Machine learning techniques include supervised learning, where models are trained using labeled datasets; unsupervised learning, which involves the identification of patterns in unlabeled data; semi-supervised learning, which combines both labeled and unlabeled data; and reinforcement learning, where models learn optimal behaviors through trial and error interactions with an environment. BRIEF SUMMARY OF SOME EXAMPLES The following summarizes some aspects of the present disclosure to provide a basic understanding of the discussed technology. This summary is not an extensive overview of all contemplated features of the disclosure and is intended neither to identify key or critical elements of all aspects of the disclosure nor to delineate the scope of any or all aspects of the disclosure. Its sole purpose is to present some concepts of one or more aspects of the disclosure in summary form as a prelude to the more detailed description that is presented later. One embodiment provides a method that includes receiving an image of a scene captured by a camera; determining, with a first machine learning model, a plurality of position values relative to the camera for at least a first subset of pixels within the image; and training a second machine learning model based on the determined position values. Another embodiment provides a system that includes a processor and a memory storing instructions which, when executed by the processor, cause the processor to perform operations including receiving an image of a scene captured by a camera; determining, with a first machine learning model, a plurality of position values relative to the camera for at least a first subset of pixels within the image; and training a second machine learning model based on the determined position values. An additional embodiment provides a non-transitory computer-readable medium storing instructions which, when executed by a processor, cause the processor to perform operations including receiving an image of a scene captured by a camera; determining, with a first machine learning model, a plurality of position values relative to the camera for at least a first subset of pixels within the image; and training a second machine learning model based on the determined position values. A further embodiment provides a method that includes receiving a first image of a scene captured by a camera; determining positions for pixels of the first image relative to the camera; determining basis trajectories based on movement of the positions relative to at least one previous image frame; and determining, for each of at least a subset of the pixels, a movement trajectory relative to the at least one previous image frame as a weighted combination of the basis trajectories. Another embodiment provides a system that includes a processor and a memory storing instructions which, when executed by the processor, cause the processor to perform operations including: receiving a first image of a scene captured by a camera; determining positions for pixels of the first image relative to the camera; determining basis trajectories based on movement of the positions relative to at least one previous image frame; and determining, for each of at least a subset of the pixels, a movement trajectory relative to the at least one previous image frame as a weighted combination of the basis trajectories. An additional embodiment provides a non-transitory computer-readable medium storing instructions which, when executed by a processor, cause the processor to perform operations that include receiving a first image of a scene captured by a camera; determining positions for pixels of the first image relative to the camera; determining basis trajectories based on movement of the positions relative to at least one previous image frame; and determining, for each of at least a subset of the pixels, a movement trajectory relative to the at least one previous image frame as a weighted combination of the basis trajectories. The foregoing has outlined rather broadly the features and tec