CN-116524074-B - Method, device, equipment and storage medium for generating digital human gestures

CN116524074BCN 116524074 BCN116524074 BCN 116524074BCN-116524074-B

Abstract

The embodiment of the invention provides a method, a device, equipment and a storage medium for generating a digital human gesture, wherein the method comprises the steps of obtaining a target audio file of the digital human gesture to be generated; and based on the action generation sequence and the gesture generation model, controlling the generated representative gesture and rhythmic gesture to be synthesized into the digital human gesture corresponding to the target audio file. According to the method provided by the invention, through the action generation sequence corresponding to the target audio file determined by the script generation model, digital human gesture synthesis under synchronous voice is effectively controlled, the gesture is decoupled and modeled to obtain the representative gesture generation model and the rhythmic gesture generation model, and the representative gesture and the rhythmic gesture respectively obtained by the gesture generation model are combined, so that more natural and rich gestures can be generated, and the effect of the digital human gesture is more real.

Inventors

GAO NAN
ZENG ZHI
ZHANG SHUWU
ZHANG GUIXUAN
ZHAO ZEYU

Assignees

中国科学院自动化研究所

Dates

Publication Date: 20260505
Application Date: 20230323

Claims (9)

1. A method of digital human gesture generation, comprising: Acquiring a target audio file of a digital human gesture to be generated; determining an action generation sequence corresponding to the target audio file based on a script generation model, wherein the action generation sequence is used for indicating whether gesture actions exist at any moment; based on the action generation sequence and a gesture generation model, the generated representative gesture and rhythmic gesture are controlled to be synthesized into a digital human gesture corresponding to the target audio file; The script generation model is obtained by training a training sample determined based on a first video file with voice information and action information, and comprises a first gesture generation model and a second gesture generation model, wherein the first gesture generation model is used for generating a representative gesture, and the second gesture generation model is used for generating a rhythmic gesture; the step of controlling the generated representative gesture and rhythmic gesture to be synthesized into the digital human gesture corresponding to the target audio file based on the action generation sequence and the gesture generation model comprises the following steps: generating a representative gesture corresponding to the target audio file based on the first gesture generation model; generating a rhythmic gesture corresponding to the target audio file based on the second gesture generation model; And fusing the representative gesture and the rhythmic gesture based on the action occurrence sequence and a preset synthesis rule to obtain a digital human gesture corresponding to the target audio file, wherein the preset synthesis rule is used for limiting the digital human gesture at any moment to be determined based on any one or combination of the representative gesture and the rhythmic gesture.
2. The method of claim 1, wherein the script generation model is trained based on training samples determined from a first video file having voice information and motion information, the corresponding training method comprising: determining the first N elements of the current element in an initial training sample and the last M elements of the current element as one element in a first training sample, wherein the initial training sample is used for representing whether motion occurs in the first video file at different moments or not; And the first loss function is determined based on a first mark corresponding to the first training sample and a predicted result output after the first training sample is input into the script generation model.
3. The method according to claim 2, wherein the initial training samples are training samples for characterizing whether there is an action occurrence in the first video file at different moments, and the corresponding acquisition method includes: Acquiring a first position sequence signal based on a first video file, wherein the first position sequence signal is used for representing positions corresponding to all key points of human hand bones, human body bones and facial bones at different moments; determining a starting position corresponding to any gesture represented by the first position sequence signal; and marking the target element by adopting a first mark based on the distance between the target element in the first position sequence and the starting position and the distance between the next element of the target element in the first position sequence and the starting position, wherein the target element is any element in the first position sequence and is used as the initial training sample, and the first mark is used for representing whether the action occurs to the target element.
4. A method of digital human gesture generation according to claim 3, wherein said determining a starting position corresponding to any gesture characterized by the first position sequence signal comprises: Simplifying the first position sequence signal according to preset simplified gesture key points, and counting to obtain a position histogram of the simplified gesture key points; And determining the most frequent position in the histogram according to a preset complete gesture format, and taking the most frequent position as the initial position corresponding to any gesture represented by the first position sequence signal.
5. A method of digital human gesture generation according to claim 3, wherein the training method corresponding to the first gesture generation model comprises: extracting a sample representing that motion occurs in the initial training sample based on the first mark, and taking the sample as a second training sample; uniformly sampling the second training samples to obtain third training samples with uniform lengths, wherein each sample in the third training samples comprises L uniformly sampled second training samples, and L is a positive integer; And based on the third training sample, under the condition that a second loss function meets convergence, training of the first gesture generation model is completed, and the second loss function is determined based on the marks of the third training sample and the gesture reconstructed by the first gesture generation model.
6. The method for generating a digital human gesture according to claim 5, wherein before uniformly sampling the second training samples to obtain third training samples with uniform lengths, further comprises: determining whether key point data is lost for each sample in the second training samples; And if any one of the second training samples has the key point data loss, repairing by adopting a mode of rotating and translating the adjacent training samples.
7. An apparatus for digital human gesture generation, comprising: the acquisition module is used for acquiring a target audio file of the digital human gesture to be generated; the determining module is used for determining an action generation sequence corresponding to the target audio file based on the script generation model, wherein the action generation sequence is used for indicating whether gesture actions exist at any moment; the generation module is used for controlling the generated representative gesture and rhythmic gesture to be synthesized into the digital human gesture corresponding to the target audio file based on the action generation sequence and the gesture generation model; The script generation model is obtained by training a training sample determined based on a first video file with voice information and action information, and comprises a first gesture generation model and a second gesture generation model, wherein the first gesture generation model is used for generating a representative gesture, and the second gesture generation model is used for generating a rhythmic gesture; The generating module is specifically configured to: generating a representative gesture corresponding to the target audio file based on the first gesture generation model; generating a rhythmic gesture corresponding to the target audio file based on the second gesture generation model; And fusing the representative gesture and the rhythmic gesture based on the action occurrence sequence and a preset synthesis rule to obtain a digital human gesture corresponding to the target audio file, wherein the preset synthesis rule is used for limiting the digital human gesture at any moment to be determined based on any one or combination of the representative gesture and the rhythmic gesture.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of digital human gesture generation of any one of claims 1 to 6 when the program is executed by the processor.
9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the method of digital human gesture generation of any of claims 1 to 6.

Description

Method, device, equipment and storage medium for generating digital human gestures Technical Field The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating digital human gestures. Background The digital person understands and analyzes the external input through the recognition system, generates a feedback result aiming at the driving signal, synthesizes corresponding digital person voice and behavior action based on the decisions, and realizes interaction with human beings. The digital human action driving effect is a key factor affecting the digital human personification degree. In particular, the gesture has a strong auxiliary expression effect, and can effectively promote expression as non-language information. Recent developments in deep learning technology have also facilitated the development of gesture generation technology, employing large-scale data sets, and modeling relationships between multiple modalities using deep neural networks. Most of the existing methods for generating digital human gestures adopt fixed rules, the gestures in a well-defined database are matched, the fixed rules need professional personnel and priori knowledge to design, and for complex voice scenes, the generated results are not abundant enough, and the sense of reality and the sense of naturalness are not enough. The threshold is high and the corresponding result is not very ideal. Therefore, how to generate digital human gestures rich in realism and naturalness by using the existing large-scale data sets has become a technical problem to be solved in the industry. Disclosure of Invention Aiming at the technical problems in the prior art, the invention provides a method, a device, equipment and a storage medium for generating digital human gestures. In a first aspect, the present invention provides a method for digital human gesture generation, comprising: Acquiring a target audio file of a digital human gesture to be generated; determining an action generation sequence corresponding to the target audio file based on a script generation model, wherein the action generation sequence is used for indicating whether gesture actions exist at any moment; based on the action generation sequence and a gesture generation model, the generated representative gesture and rhythmic gesture are controlled to be synthesized into a digital human gesture corresponding to the target audio file; the script generation model is trained based on a training sample determined by a first video file with voice information and motion information, and comprises a first gesture generation model and a second gesture generation model, wherein the first gesture generation model is used for generating a representative gesture, and the second gesture generation model is used for generating a rhythmic gesture. Optionally, based on the action generating sequence and the gesture generating model, the controlling the generated representative gesture and rhythmic gesture to be synthesized into the digital human gesture corresponding to the target audio file includes: generating a representative gesture corresponding to the target audio file based on the first gesture generation model; generating a rhythmic gesture corresponding to the target audio file based on the second gesture generation model; And fusing the representative gesture and the rhythmic gesture based on the action occurrence sequence and a preset synthesis rule to obtain a digital human gesture corresponding to the target audio file, wherein the preset synthesis rule is used for limiting the digital human gesture at any moment to be determined based on any one or combination of the representative gesture and the rhythmic gesture. Optionally, the script generating model is trained based on training samples determined by a first video file having voice information and motion information, and the corresponding training method includes: determining the first N elements of the current element in an initial training sample and the last M elements of the current element as one element in a first training sample, wherein the initial training sample is used for representing whether motion occurs in the first video file at different moments or not; And the first loss function is determined based on a first mark corresponding to the first training sample and a predicted result output after the first training sample is input into the script generation model. Optionally, the initial training samples are training samples for characterizing whether actions occur in the first video file at different moments, and the corresponding acquisition method includes: Acquiring a first position sequence signal based on a first video file, wherein the first position sequence signal is used for representing positions corresponding to all key points of human hand bones, human body bones and facial bones at