CN-122029568-A - Machine learning based three-dimensional model generation

CN122029568ACN 122029568 ACN122029568 ACN 122029568ACN-122029568-A

Abstract

Systems and techniques for generating a three-dimensional (3D) model are disclosed. For example, a process can include estimating a plurality of features associated with at least a portion of an image, inverse warping the plurality of features to reference pose features having a reference pose, generating filtered reference pose features by selecting features from the reference pose features, selecting distances based on the selected features and corresponding features from the reference pose, generating modified reference pose features by modifying the filtered reference pose features based on a feature grid associated with a reference model associated with the reference pose, projecting the filtered reference pose features into one or more two-dimensional (2D) planes, identifying first features associated with a person from the one or more 2D planes, and generating a 3D model of the person having the pose using the first features, the modified reference pose features, and pose information.

Inventors

C. Talegao Enkar
LIU PENG
L.WANG
ZHANG JUNKANG
BI NING

Assignees

高通股份有限公司

Dates

Publication Date: 20260512
Application Date: 20240918
Priority Date: 20240216

Claims (20)

1. An apparatus for generating a variable three-dimensional (3D) model, the apparatus comprising: At least one memory, and At least one processor coupled to the at least one memory and configured to: estimating a plurality of features associated with the image of the person; inverse warping the plurality of features into reference pose features having a reference pose; generating a filtered reference pose feature by selecting features from the reference pose features, the selecting based on distances of the selected features from corresponding features from the reference pose; Generating a modified reference pose feature by modifying the filtered reference pose feature based on a feature grid associated with a reference model associated with the reference pose; projecting the filtered reference pose features into one or more two-dimensional (2D) planes; Identifying a first feature associated with the person from the one or more 2D planes, and A 3D model of the person having the pose is generated using the first feature, the modified reference pose feature, and pose information corresponding to the pose.
2. The apparatus of claim 1, wherein the at least one processor is configured to: selecting one or more features from the filtered reference pose features that correspond to a first reference feature of the reference model, and The one or more features are interpolated to represent the first reference feature in the modified reference pose features.
3. The apparatus of claim 1, wherein the at least one processor is configured to: Estimating a location of a plurality of bones in the image using the first model, and Estimating a location of the plurality of features based on the locations of the plurality of bones using a second model, wherein the plurality of features correspond to a surface of the person in the image.
4. The apparatus of claim 1, wherein the reference pose feature comprises a T-pose.
5. The apparatus of claim 1, wherein the feature grid associated with the reference model comprises features anchored on a skinned multi-person linear model in the reference pose.
6. The apparatus of claim 1, wherein the at least one processor is configured to: A positioning code is added to each of the modified reference pose features.
7. The apparatus of claim 1, wherein the at least one processor is configured to: parameters of a machine learning model are trimmed based on the first features extracted from one or more of the images.
8. The apparatus of claim 1, wherein the reference model comprises a machine learning model trained from a dataset comprising images of people having different physical characteristics.
9. The apparatus of claim 8, wherein the machine learning model comprises a multi-layered perceptron configured to generate the 3D model based on the first feature, the modified reference pose feature, and the pose information.
10. The apparatus of claim 8, wherein the at least one processor is configured to: the modified reference pose feature and points in the first feature are warped based on the pose information using the machine learning model.
11. The apparatus of claim 1, wherein the at least one processor is configured to: Each of the one or more 2D planes is queried to obtain information related to a feature of the filtered reference pose features to identify the first feature.
12. The apparatus of claim 1, wherein the at least one processor is configured to: combining the filtered reference pose feature and the first feature into a combined feature, and The combined features are provided to a machine learning model to generate the 3D model of the person.
13. A method for generating a variable three-dimensional (3D) model, the method comprising: estimating a plurality of features associated with the image of the person; inverse warping the plurality of features into reference pose features having a reference pose; generating a filtered reference pose feature by selecting features from the reference pose features, the selecting based on distances of the selected features from corresponding features from the reference pose; Generating a modified reference pose feature by modifying the filtered reference pose feature based on a feature grid associated with a reference model associated with the reference pose; projecting the filtered reference pose features into one or more two-dimensional (2D) planes; Identifying a first feature associated with the person from the one or more 2D planes, and A 3D model of the person having the pose is generated using the first feature, the modified reference pose feature, and pose information corresponding to the pose.
14. The method of claim 13, wherein modifying the filtered reference pose features based on the feature grid comprises: selecting one or more features from the filtered reference pose features that correspond to a first reference feature of the reference model, and The one or more features are interpolated to represent the first reference feature in the modified reference pose features.
15. The method of claim 13, wherein estimating the plurality of features comprises: Estimating a location of a plurality of bones in the image using the first model, and Estimating a location of the plurality of features based on the locations of the plurality of bones using a second model, wherein the plurality of features correspond to a surface of the person in the image.
16. The method of claim 13, wherein the feature grid associated with the reference model comprises features anchored on a skinned multi-person linear model in the reference pose.
17. The method of claim 13, the method further comprising: A positioning code is added to each of the modified reference pose features.
18. The method of claim 13, wherein identifying the first feature comprises: parameters of a machine learning model are trimmed based on the first features extracted from one or more of the images.
19. The method of claim 13, wherein the reference model comprises a machine learning model trained from a dataset comprising images of people having different physical characteristics.
20. The method of claim 19, the method further comprising: the modified reference pose feature and points in the first feature are warped based on the pose information using the machine learning model.

Description

Machine learning based three-dimensional model generation Technical Field The present disclosure relates generally to generating three-dimensional models. For example, aspects of the present disclosure relate to systems and techniques for generating a three-dimensional (3D) model using a machine learning model (e.g., a deep neural network) such as based on monocular video. Background Many devices and systems allow capturing a scene by generating an image (or frame) and/or video data (including a plurality of frames) of the scene. For example, a camera or a device including a camera may capture a sequence of frames of a scene (e.g., a video of the scene). In some cases, the sequence of frames may be processed for performing one or more functions, may be output for display, may be output for processing and/or consumption by other devices, and other uses. The artificial neural network may be implemented using computer techniques inspired by logical reasoning performed by the biological neural network in the mammal. Deep neural networks (such as convolutional neural networks) are widely used in many applications such as object detection, object classification, object tracking, big data analysis, and the like. For example, convolutional neural networks can extract high-level features (such as facial shapes) from an input image and use these high-level features to output, for example, a probability that the input image includes a particular object. Disclosure of Invention In some examples, systems and techniques for generating a three-dimensional (3D) model using a machine learning model are described. According to at least one example, a method includes estimating a plurality of features associated with an image of a person, back-warping the plurality of features into reference pose features having a reference pose, generating a filtered reference pose feature by selecting features from the reference pose features, the selecting being based on distances of the selected features from corresponding features of the reference pose, generating a modified reference pose feature by modifying the filtered reference pose feature based on a feature grid associated with a reference model associated with the reference pose, projecting the filtered reference pose feature into one or more two-dimensional (2D) planes, identifying a first feature associated with the person from the one or more 2D planes, and generating a 3D model of the person having the pose using the first feature, the modified reference pose feature, and pose information corresponding to the pose. In another example, an apparatus is provided that includes at least one memory and at least one processor coupled to the at least one memory and configured to estimate a plurality of features associated with an image of a person, inverse warp the plurality of features into reference pose features having a reference pose, generate a filtered reference pose feature by selecting features from the reference pose features, the selecting based on distances of the selected features from corresponding features of the reference pose, generate a modified reference pose feature by modifying the filtered reference pose feature based on a feature grid associated with a reference model associated with the reference pose, project the filtered reference pose feature into one or more 2D planes, identify a first feature associated with the person from the one or more 2D planes, and generate a 3D model of the person having the pose using the first feature, the modified reference pose feature, and pose information corresponding to the pose. In another example, a non-transitory computer-readable medium having instructions stored thereon that, when executed by at least one processor, cause the at least one processor to estimate a plurality of features associated with an image of a person, inverse warp the plurality of features into reference pose features having a reference pose, generate a filtered reference pose feature by selecting features from the reference pose features, the selecting being based on distances of the selected features from corresponding features from the reference pose, generate a modified reference pose feature by modifying the filtered reference pose feature based on a feature grid associated with the reference pose, project the filtered reference pose feature into one or more 2D planes, identify a first feature associated with the person from the one or more 2D planes, and generate a 3D model of the person having the pose using the first feature, the modified reference pose feature, and pose information corresponding to the pose. In another example, an apparatus for processing one or more images is provided. The apparatus includes means for estimating a plurality of features associated with an image of a person, means for inverse warping the plurality of features into reference pose features having a reference pose, means for generating a filtered reference pose