CN-121788726-B - Single image three-dimensional character interaction generation method based on multi-mode deep learning

CN121788726BCN 121788726 BCN121788726 BCN 121788726BCN-121788726-B

Abstract

The invention belongs to the technical field of image processing, and provides a method for generating three-dimensional character interaction of a single image based on multi-mode deep learning, which comprises the steps of performing instance detection and segmentation processing on the single image to be processed to obtain two-dimensional region information; according to the single image to be processed and the two-dimensional area information, character interaction reasoning is carried out by utilizing a multi-mode large language model, a condition vector is generated through encoding, character two-dimensional image characteristics are determined according to the two-dimensional area information, three-dimensional geometry and space relation construction is carried out based on the character two-dimensional image characteristics and the condition vector, three-dimensional human body grid and object point cloud data are obtained, and a three-dimensional character interaction result is obtained through three-dimensional space optimization and interaction probability distribution prediction according to the three-dimensional human body grid, the object point cloud data and the condition vector. According to the scheme, the integrated modeling of semantic conditions and three-dimensional geometry under single image input is realized, and finally, the output three-dimensional character interaction result has good three-dimensional consistency.

Inventors

WANG LIYUAN
LUO HONGCHEN

Assignees

东北大学

Dates

Publication Date: 20260512
Application Date: 20260305

Claims (10)

1. A method for generating three-dimensional character interaction of a single image based on multi-mode deep learning is characterized by comprising the following steps: Performing instance detection and segmentation processing on a single image to be processed to obtain two-dimensional region information; According to the single image to be processed and the two-dimensional region information, carrying out character interaction reasoning by utilizing a multi-mode large language model, and generating a condition vector through encoding; Determining character two-dimensional image characteristics according to the two-dimensional area information, and constructing three-dimensional geometric and spatial relations based on the character two-dimensional image characteristics and the condition vectors to obtain three-dimensional human body grids and object point cloud data; And according to the three-dimensional human body grid, the object point cloud data and the condition vector, obtaining a three-dimensional character interaction result through three-dimensional space optimization and interaction probability distribution prediction.
2. The method for generating three-dimensional character interaction of single image based on multi-modal deep learning according to claim 1, wherein the performing the instance detection and segmentation processing on the single image to be processed to obtain two-dimensional region information comprises: Determining a human body instance in a single image to be processed and a target object instance with potential interaction relation with the human body instance; Respectively acquiring a boundary box and an instance mask corresponding to the human body instance and the target object instance; the bounding box and the instance mask are taken as two-dimensional region information.
3. The method for generating three-dimensional character interaction of single images based on multi-modal deep learning according to claim 1, wherein character interaction reasoning is performed by using a multi-modal large language model according to the single images to be processed and the two-dimensional region information, and condition vectors are generated by encoding, comprising: Inputting the single image to be processed and the two-dimensional region information into a pre-constructed multi-mode large language model, and obtaining structured priori knowledge through character interaction reasoning; And carrying out text coding on the structured priori knowledge to obtain a condition vector in a numerical form.
4. The method for generating three-dimensional character interaction of single image based on multi-modal deep learning according to claim 1, wherein the constructing three-dimensional geometric and spatial relationships based on the character two-dimensional image features and the condition vectors to obtain three-dimensional human body grid and object point cloud data comprises: Based on the character two-dimensional image characteristics and the condition vector, predicting to obtain three-dimensional parameters of a human body through a pre-constructed gesture estimation model, and generating an initial human body grid; Taking the character two-dimensional image characteristics and the condition vector as condition input, and generating initial point cloud data through a diffusion generation model constructed in advance; And carrying out normalization and coordinate alignment processing on the initial human body grid and the initial point cloud data to obtain three-dimensional human body grid and object point cloud data.
5. The method for generating three-dimensional character interaction of single image based on multi-modal deep learning according to claim 1, wherein obtaining three-dimensional character interaction results through three-dimensional space optimization and interaction probability distribution prediction according to the three-dimensional human body grid, the object point cloud data and the condition vector comprises: Remapping the three-dimensional human body grid and the object point cloud data to the single image to be processed, and performing three-dimensional space optimization through a multi-dimensional loss function constructed in advance to obtain optimized three-dimensional human body grid and object point cloud data; Extracting three-dimensional geometric features in the optimized three-dimensional human body grid and object point cloud data, and performing cross-modal fusion on the three-dimensional geometric features and the condition vector to obtain a cross-modal fusion result; And according to the cross-modal fusion result, carrying out interaction probability distribution prediction by utilizing a pre-constructed interaction prediction model to obtain a three-dimensional character interaction result.
6. The method for generating three-dimensional character interaction of single image based on multi-modal deep learning according to claim 5, wherein the three-dimensional space is optimized through a multi-dimensional loss function constructed in advance to obtain optimized three-dimensional human body grid and object point cloud data, comprising: fixing parameters of the three-dimensional human body grid, and optimizing only object pose parameters in the object point cloud data to obtain optimized object pose parameters; Fixing parameters of the three-dimensional human body grid, taking object dimensions as optimization parameters based on the optimized object pose parameters, and executing object joint optimization to obtain optimized object point cloud data; And carrying out local optimization on the human body posture parameters in the three-dimensional human body grid based on the optimized object point cloud data to obtain an optimized three-dimensional human body grid.
7. The method for generating three-dimensional character interaction of single image based on multi-modal deep learning according to claim 6, wherein optimizing object pose parameters in the object point cloud data to obtain optimized object pose parameters comprises: setting a corresponding set of human body and object contact based on the spatial relationship between the object pose parameters and grid vertices in the three-dimensional human body grid; establishing a contact distance loss function based on the corresponding set of human body and object contact; and carrying out object registration by minimizing the function value of the contact distance loss function to obtain the optimized object pose parameters.
8. The method for generating three-dimensional character interaction of single image based on multi-modal deep learning according to claim 7, wherein performing object joint optimization with object dimensions as optimization parameters to obtain optimized object point cloud data comprises: Constructing an object mask consistency loss function, a penetration penalty loss function and an object scale regular term; Weighting the contact distance loss function, the object mask consistency loss function, the interpolative penalty loss function and the object scale regular term to establish a joint optimization loss function; And executing object joint optimization according to the joint optimization loss function to obtain optimized object point cloud data.
9. The multi-modal deep learning-based single-image three-dimensional character interaction generation method according to claim 8, wherein the performing local optimization on the human body posture parameters in the three-dimensional human body grid based on the optimized object point cloud data to obtain an optimized three-dimensional human body grid comprises: constructing a human mask consistency loss function and a human posture regular term; weighting the contact distance loss function, the interpolative penalty loss function, the human mask consistency loss function and the human posture regular term to establish a local optimization loss function; And based on the optimized object point cloud data, performing local optimization of the human body posture parameters in the three-dimensional human body grid by utilizing the local optimization loss function to obtain the optimized three-dimensional human body grid.
10. The method for generating three-dimensional character interaction of single image based on multi-modal deep learning according to claim 5, wherein the step of predicting interaction probability distribution by using a pre-constructed interaction prediction model according to the cross-modal fusion result to obtain a three-dimensional character interaction result comprises the following steps: inputting the cross-modal fusion result into a pre-constructed interaction prediction model to obtain a grid vertex contact probability distribution prediction result and a point cloud availability interaction probability distribution prediction result; Performing binarization screening on the grid vertex contact probability distribution prediction result and the point cloud availability interaction probability distribution prediction result to determine a core interaction area between a human body and an object; performing three-dimensional space association verification on the point data in the core interaction area to obtain an association verification result; and if the association verification result accords with the three-dimensional consistency constraint, generating a three-dimensional character interaction result through multi-dimensional data integration.

Description

Single image three-dimensional character interaction generation method based on multi-mode deep learning Technical Field The invention relates to the technical field of image processing, in particular to a single-image three-dimensional character interaction generation method based on multi-mode deep learning. Background Currently, aiming at the related technologies of human body three-dimensional reconstruction, object three-dimensional generation and human body and object interaction prediction of a single image, a plurality of technical bottlenecks still exist in the practical landing application, and the requirements of high-precision and high-consistency three-dimensional character interaction generation are difficult to meet. In the related technology, links such as three-dimensional grid generation, object monocular three-dimensional point cloud reconstruction, human body and object interaction element prediction and the like are mostly realized by independent separation, an integrated modeling frame and a unified three-dimensional coordinate constraint system are lacked, and the inherent problems such as uncertain depth information, target shielding, scale blurring and the like exist in a single image, so that stable coordinate alignment cannot be realized in a three-dimensional space between the generated three-dimensional human body grid and the object point cloud, and further the problems such as geometrical inconsistency, spatial position deviation and the like are very easy to occur in the contact area and availability interaction area prediction of a human body and an object, and the accuracy of interaction prediction results is influenced. Meanwhile, with the development of multi-modal deep learning technology, the multi-modal large language model can excavate semantic priori information such as object types, potential interaction modes of human bodies and objects and the like from images, and provides possibility for improving semantic rationality of three-dimensional character interaction generation, but in the prior art, the semantic priors output by the multi-modal large language model are mostly in unstructured free text forms, and cannot effectively cooperate with links such as three-dimensional geometry generation, spatial relationship optimization and the like, so that when the open scene facing unknown objects and complex shielding is faced, three-dimensional character interaction results with consistent semantics and reasonable geometry are difficult to generate. In addition, in the prior art, in the process of extracting two-dimensional image features from two-dimensional region information and constructing three-dimensional geometric and spatial relationship based on the features, semantic constraint and spatial alignment constraint of feature layers are absent, so that morphological distortion is easy to occur in the process of converting the two-dimensional features into the three-dimensional geometric, the generated spatial relationship between an initial three-dimensional human body grid and an object point cloud is poor in rationality, the calculation cost of a subsequent optimization link is high, the optimization effect is limited, and the overall efficiency and the result quality of three-dimensional character interactive generation are further reduced. Therefore, the traditional character interaction generation scheme has the technical problems of poor cooperativity, low precision and poor reliability of each link. Disclosure of Invention The invention provides a single image three-dimensional character interaction generating method based on multi-mode deep learning, which is used for solving the defects of poor cooperativity, low precision and poor reliability of each link of the traditional character interaction generating scheme. The invention provides a single image three-dimensional character interaction generation method based on multi-mode deep learning, which comprises the following steps: Performing instance detection and segmentation processing on a single image to be processed to obtain two-dimensional region information; According to the single image to be processed and the two-dimensional region information, carrying out character interaction reasoning by utilizing a multi-mode large language model, and generating a condition vector through encoding; Determining character two-dimensional image characteristics according to the two-dimensional area information, and constructing three-dimensional geometric and spatial relations based on the character two-dimensional image characteristics and the condition vectors to obtain three-dimensional human body grids and object point cloud data; And according to the three-dimensional human body grid, the object point cloud data and the condition vector, obtaining a three-dimensional character interaction result through three-dimensional space optimization and interaction probability distribution prediction. According to t