CN-121999097-A - Image-driven three-dimensional animated asset generation method, apparatus, and program product based on unified field representation

CN121999097ACN 121999097 ACN121999097 ACN 121999097ACN-121999097-A

Abstract

The invention provides a method, equipment and a program product for generating an image-driven three-dimensional animated asset based on unified field representation. The method comprises the steps of extracting characteristics of an acquired single RGB image to obtain multi-view characteristics and global semantic characteristics, generating sparse three-dimensional voxel representation based on the multi-view characteristics and the global semantic characteristics, generating unified field representation based on the sparse three-dimensional voxel representation through a structured encoder, wherein the unified field representation comprises a shape field, a skeleton field and a skin field, generating a three-dimensional animation asset with geometric shapes, skeleton structures and skin weights based on the unified field representation, wherein the skeleton field adopts a confidence attenuation mechanism to process the ambiguity of skeleton connection, and the skin field adopts double-skin characteristic fields to respectively associate the geometric characteristics and the skeleton characteristics. The method has the advantages of directly generating the animated three-dimensional asset with high-fidelity geometry, reasonable skeleton structure and accurate skinning weight from a single image.

Inventors

CAO YANPEI
LIANG DING
QI XIAOJUAN
HUANG YIHUA
ZOU ZIXIN
GUO YUANCHEN
SONG YACHEN

Assignees

北京哇嘶嗒科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260202

Claims (10)

1. An image-driven three-dimensional animated asset generation method based on unified field representation, comprising: extracting features of the acquired single RGB image to obtain multi-view features and global semantic features; generating a sparse three-dimensional voxel representation based on the multi-view features and the global semantic features, the sparse three-dimensional voxel representation comprising sparse voxels of geometry and sparse voxels of skeletal structure; generating, by a structured encoder, a unified field representation based on the sparse three-dimensional voxel representation, the unified field representation comprising a shape field, a bone field, and a skin field; generating a three-dimensional animated asset having geometry, skeletal structure, and skin weights based on the unified field representation; Wherein the bone field adopts a confidence attenuation mechanism to treat the ambiguity of bone connection, and the skin field adopts a double skin characteristic field to respectively correlate geometric characteristics and bone characteristics.
2. The method for generating the unified field representation based image driven three-dimensional animated asset according to claim 1, wherein the feature extraction of the acquired single RGB image comprises: and extracting global semantic features by adopting a visual model, and extracting multi-view features by combining a multi-view projection technology.
3. The unified field representation-based image driven three-dimensional animated asset generation method of claim 1 wherein the generating sparse three-dimensional voxel representation based on multi-perspective features and global semantic features comprises a two-stage generation of: The first stage of generating a sparse structural skeleton comprising sparse voxels of geometric shape and skeleton occupation space; the second stage generates the data required for the three-dimensional geometry and animation.
4. The method of image-driven three-dimensional animated asset generation based on a unified field representation of claim 1, The shape field is the mapping from the space voxels to the local geometric parameters, and is constructed based on signed distance, normal vector, color and shape detail interpolation weight; the skeleton field is the offset vector connection from the space point to the nearest joint and the parent node of the previous stage skeleton; The skin field implicitly models geometric and skeletal features, respectively, by a dual skin feature field.
5. The unified field representation-based image driven three-dimensional animated asset generation method of claim 4 wherein the skin field implicitly models geometric and skeletal features, respectively, by dual skin feature fields, comprising: For any point in space, respectively acquiring a geometric feature vector and a bone feature vector through double skin feature fields; And carrying out fusion calculation on the geometric feature vector and the skeleton feature vector corresponding to the same three-dimensional point to obtain the skin weight of the three-dimensional point for each skeleton.
6. The unified field representation-based image driven three-dimensional animated asset generation method of claim 5 wherein the formula for calculating the skin weights is: , Wherein, the Representing vertices Is subjected to the first The skin weight affected by the root bone, Is the predicted temperature of the apex of the mold, The vertex skin features and the joint skin features of the dual skin feature field predictions are represented respectively, Representing index along bone A Softmax normalization function was performed.
7. The method of image-driven three-dimensional animated asset generation based on a unified field representation of claim 1, The generating a three-dimensional animated asset having geometry, skeletal structure, and skin weights based on the unified field representation, comprising: The unified field representation is decoded into a three-dimensional grid, a skeletal hierarchy, and a skin weight distribution.
8. The method of image-driven three-dimensional animated asset generation based on a unified field representation of claim 1, wherein the generating a three-dimensional animated asset with geometry, skeletal structure, and skin weights based on the unified field representation further comprises, Based on the joint density conditions, the number of joints is dynamically controlled and dynamically adjusted to generate the skeletal complexity and animation flexibility of the three-dimensional animated asset.
9. An electronic device, the electronic device comprising: At least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the unified field presentation based image driven three dimensional animated asset generation method of any of claims 1-8.
10. A computer program product comprising computer programs/instructions which when executed by a processor implement the unified field representation-based image driven three-dimensional animated asset generation method of any of claims 1-8.

Description

Image-driven three-dimensional animated asset generation method, apparatus, and program product based on unified field representation Technical Field The invention belongs to the technical field of three-dimensional image generation, and particularly relates to an image-driven three-dimensional animation asset generation method, device and program product based on unified field representation. Background With the development of three-dimensional generation technology, three-dimensional generation models (such as NeRF, triplane and SparseVoxel) based on deep learning have been remarkably developed in recent years. These methods may generate a static three-dimensional asset of high fidelity through text or image input. However, these generated assets are typically limited to "static statues" and lack animation, failing to meet the actual needs of interactive graphics, virtual reality, game development, and animation. To address the animation problem of three-dimensional assets, many studies have attempted to impart animation capabilities to static three-dimensional assets through post-processing (e.g., automatically binding bones and skin weights). For example, the classical approach, pinocchio, embeds a template bone into a static three-dimensional model, rigNet predicts bone and skin weights using a deep learning approach. However, these methods rely on the clean topology of the three-dimensional model, and the resulting three-dimensional asset tends to suffer from small irregularities in geometry (e.g., incomplete surfaces, blurred volume boundaries), resulting in failure of the late auto-binding process. Furthermore, these post-processing techniques often fail to generate bone structures that are perfectly consistent with geometry, resulting in bone misalignment or motion distortion in the animation. There are also prior art techniques for directly generating animated three-dimensional assets. For example, uniRig, anymate and RIGANYTHING methods directly generate animation models in a manner that generates skeletal and skinned weights. However, these methods are generally only applicable to specific classes (e.g., humans and animals), and it is difficult to process models of multiple classes or complex shapes. Furthermore, they often separate geometric generation from skeletal generation, failing to model shape, skeletal and skinning weights uniformly, resulting in the generation results still lacking in consistency and functionality. The prior art is mainly divided into the following categories: 1. Static three-dimensional generation by generating a high quality static three-dimensional model from an image or text, such as DreamFusion, TRELLIS and TriplaneGaussian. These methods perform well in terms of detail retention and geometric accuracy, but the results are limited to static geometry and do not have animation capabilities. 2. The automatic binding technique includes: (1) Template-based binding, such as the Pinocchio method, animation is achieved by embedding a predefined template skeleton into the three-dimensional model. However, this approach relies heavily on the topology of the input model, making it difficult to handle geometric irregularities in the generated model. (2) Deep learning based bindings such as RigNet and DRiVE, use neural networks to predict bone and skin weights, but these methods typically require extensive annotation data training and are less generalizable for unseen categories. (3) Animation models are directly generated, such as UniRig and Anymate, and animation models are directly generated by generating skeletal and skinned weights. However, these methods typically separate geometry generation from animation generation, failing to model uniformly, resulting in poor consistency and functionality of the generated results. The inventors have found in practicing this embodiment that the prior art has the following disadvantages: 1. Static models lack utility-existing static three-dimensional generation methods (e.g., dreamFusion and TRELLIS, etc.), while capable of generating high quality geometries, lack animation capabilities, and cannot be directly applied to scenes that require skeletal structures and skinning weights, such as games, animations, and interactive applications. 2. The post-processing flow is unreliable, the method of automatically binding the bones and the skin weights (such as Pinocchio and RigNet) has high requirements on the geometric integrity and the topological accuracy of the generated model, and the generated three-dimensional asset usually has fine geometric defects, so that the post-processing is failed or the generated bones are not matched with the shapes. 3. The generation and binding steps are split, namely the existing method for directly generating the animation model (such as UniRig and Anymate) generally separates geometric generation from bone and skin generation, and the inherent relevance among the three is not effectively utilized, so t