CN-122023602-A - Digital person generation method, device, equipment and medium

CN122023602ACN 122023602 ACN122023602 ACN 122023602ACN-122023602-A

Abstract

The invention provides a digital person generating method, device, equipment and medium, which comprise the steps of generating a structured execution instruction according to user input content, determining a target synthesis path according to a first cache indicator in the structured execution instruction, extracting intermediate content from a preset resource library in a first execution link or calling a corresponding generation model to generate intermediate content if the target synthesis path comprises at least two execution links, taking the intermediate content as input of a next execution link, generating content corresponding to a current execution link based on intermediate content of a last execution link in a subsequent execution link, and generating and outputting digital person videos after all the execution links are completed. The invention can greatly reduce repeated reasoning steps, effectively reduce calculation power consumption and energy consumption and improve the digital human generation efficiency.

Inventors

HUANG YUANZHONG
LU QINGHUA
CHEN GAOBO

Assignees

深圳市木愚科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (10)

1. A digital person generation method, the method comprising: generating a structured execution instruction according to the user input content; determining a target synthesis path according to a first cache indicator in the structured execution instruction; if the target synthesis path comprises at least two execution links, extracting intermediate content from a preset resource library in the first execution link, or calling a corresponding generation model to generate intermediate content; And taking the intermediate content as the input of the next execution link, generating the content corresponding to the current execution link based on the intermediate content of the previous execution link in the subsequent execution links, and generating and outputting the digital human video after all the execution links are completed.
2. The digital person generation method of claim 1, wherein after the determining the target synthetic path according to the first cache indicator in the structured execution instruction, further comprising: And if the target synthesis path comprises an execution link, extracting video content from a preset resource library as a digital human video in the execution link according to the image tag, the voice tag and the knowledge tag in the structured execution instruction.
3. The digital person generating method according to claim 1, wherein if the target synthesis path includes at least two execution links, extracting intermediate content from a preset resource library in a first execution link, or calling a corresponding generation model to generate intermediate content, includes: if the target synthesis path comprises three execution links and the first buffer indicator indicates that buffer contents available for multiplexing exist, extracting text contents from a preset resource library according to knowledge labels in the structured execution instruction in the first execution link; If the target synthesis path comprises three execution links and the first buffer indicator indicates that buffer contents available for multiplexing do not exist, a large language model is called to analyze the user input contents in the first execution link to generate text contents; If the target synthesis path comprises two execution links, extracting voice content from a preset resource library in the first execution link according to the voice tag and the knowledge tag in the structured execution instruction.
4. The digital person generating method according to claim 3, wherein after the text content is extracted from a preset resource library according to the knowledge tag in the structured execution instruction, or after the large language model is called to analyze the user input content, further comprising: In a second execution link, calling a text-to-speech model, and performing speech conversion on the text content according to the speech tag in the structured execution instruction to obtain speech content; and in a third execution link, calling a video generation module, and generating a digital human video according to the image tag and the voice content in the structured execution instruction.
5. The digital person generating method according to claim 1, wherein the step of taking the intermediate content as an input of a next execution link, and in a subsequent execution link, generating a content corresponding to a current execution link based on the intermediate content of a previous execution link, and after all execution links are completed, generating and outputting a digital person video, further comprises: And if the value of the second buffer indicator in the structured execution instruction is not negative, storing the target content generated by the corresponding execution link into a preset resource library.
6. The method for generating a digital person according to claim 5, wherein storing the target content generated by the corresponding execution link in the preset resource library if the value of the second buffer indicator in the structured execution instruction is not negative, comprises: If the value of the second cache indicator in the structured execution instruction is not NO, determining target content according to the second cache indicator; adding a corresponding index tag for the target content, wherein the index tag is used for subsequent retrieval and calling of the target content; and storing the storage content associated with the index tag into a preset resource library.
7. The digital person generation method according to claim 1, wherein the generating a structured execution instruction according to user input content includes: Extracting the characteristics of the content input by the user to obtain a characteristic value vector; And inputting the eigenvalue vector into a preset strategy network, and outputting a structured execution instruction.
8. A digital person generating apparatus, the apparatus comprising: the first generation unit is used for generating a structured execution instruction according to the user input content; a determining unit, configured to determine a target synthesis path according to a first cache indicator in the structured execution instruction; the execution unit is used for extracting intermediate content from a preset resource library in a first execution link or calling a corresponding generation model to generate intermediate content if the target synthesis path comprises at least two execution links; And the second generation unit is used for taking the intermediate content as the input of the next execution link, generating the content corresponding to the current execution link based on the intermediate content of the last execution link in the subsequent execution links, and generating and outputting the digital human video after all the execution links are completed.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the digital person generation method of any of claims 1-7 when the computer program is executed by the processor.
10. A computer readable storage medium, characterized in that the storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the digital person generating method according to any of claims 1-7.

Description

Digital person generation method, device, equipment and medium Technical Field The present invention relates to the field of digital person technologies, and in particular, to a method, an apparatus, a device, and a medium for generating a digital person. Background In the current network platform, real-time conversation and interactive video of a virtual person are a mainstream internet content form. Typically, synthesizing such video requires sequentially generating text, speech, and combining it with the underlying portrait video. The prior art generally follows a fixed pipeline flow, namely, firstly, a text is generated according to user input through a Large Language Model (LLM), then the text is converted into voice through a text-to-voice model (TTS), and finally, a final video is output by utilizing a voice-driven portrait video generation model. The process relies on real-time model reasoning in each link, so that the overall delay is obvious, often several seconds to tens of seconds, and the interactive experience is seriously affected. In addition, in repetitive or similar scenarios, the system still performs exactly the same generation flow, resulting in unnecessary repetitive expenditure of computational effort. It is noted that in commercial digital man systems, the calculation power and the electric power cost form main expenses, but the storage cost is relatively low, and the existing scheme is not fully combined with the characteristics, and lacks an effective buffering and multiplexing mechanism for intermediate generation results (such as generated text, voice and the like), so that model call cannot be reduced in multiple requests, and response delay and energy consumption cannot be reduced. Therefore, the existing digital person generating method has the problems of high response delay and waste of computational resources. Disclosure of Invention The embodiment of the invention provides a digital person generating method, a device, equipment and a medium, which aim to solve the problems of high response delay and waste of computational resources in the existing digital person generating method. In a first aspect, an embodiment of the present invention provides a digital person generating method, including: generating a structured execution instruction according to the user input content; determining a target synthesis path according to a first cache indicator in the structured execution instruction; if the target synthesis path comprises at least two execution links, extracting intermediate content from a preset resource library in the first execution link, or calling a corresponding generation model to generate intermediate content; And taking the intermediate content as the input of the next execution link, generating the content corresponding to the current execution link based on the intermediate content of the previous execution link in the subsequent execution links, and generating and outputting the digital human video after all the execution links are completed. In a second aspect, an embodiment of the present invention further provides a digital person generating apparatus, the apparatus including: the first generation unit is used for generating a structured execution instruction according to the user input content; a determining unit, configured to determine a target synthesis path according to a first cache indicator in the structured execution instruction; the execution unit is used for extracting intermediate content from a preset resource library in a first execution link or calling a corresponding generation model to generate intermediate content if the target synthesis path comprises at least two execution links; And the second generation unit is used for taking the intermediate content as the input of the next execution link, generating the content corresponding to the current execution link based on the intermediate content of the last execution link in the subsequent execution links, and generating and outputting the digital human video after all the execution links are completed. In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the method described in the first aspect when executing the computer program. In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, implement the method of the first aspect. The invention provides a digital person generating method, device, equipment and medium, which comprise the steps of generating a structured execution instruction according to user input content, determining a target synthesis path according to a first cache indicator in the structured execution instruction, extracting intermediate content from a preset resource librar