CN-122002057-A - Real person and digital person synchronous real-time live broadcast method and system

CN122002057ACN 122002057 ACN122002057 ACN 122002057ACN-122002057-A

Abstract

The application relates to the technical field of information, in particular to a real person and digital person synchronous real-time live broadcast method and system, wherein the method comprises the steps of setting an auxiliary broadcast screen in front of the real person; the method comprises the steps of collecting real person video stream at the current moment, obtaining current action sequence characteristics and corresponding voice text of a real person, outputting a predicted action label of the real person in a next predicted short period based on the action sequence characteristics, generating a predicted voice text in the next predicted short period based on a live broadcast material and a voice prediction model obtained by training the real person historical live voice text, driving a digital person to generate actions and voices in the next predicted short period according to the predicted action label and the predicted voice text, synthesizing corresponding digital person animation, superposing the digital person animation in the next predicted short period with the real person video stream in real time to obtain a live video stream, and playing the live video stream on the current moment through the auxiliary broadcast screen.

Inventors

Zhu Yuchengxi

Assignees

头流(杭州)网络科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260409

Claims (10)

1. The real person and digital person synchronous real-time live broadcasting method is characterized by comprising the following steps of: setting an auxiliary broadcasting screen in front of a real person; Acquiring a real person video stream at the current moment, and analyzing the real person video stream in real time to acquire the current action sequence characteristics of a real person and a corresponding voice text; Based on an action prediction model obtained by training by utilizing real person historical action data and voice data in advance, outputting a predicted action label of the real person in a next predicted short period according to the action sequence characteristics; generating a predicted voice text in the next predicted short period based on a voice prediction model obtained by training live broadcast materials and real person historical live voice texts; Driving a digital person to generate actions and voices in the next prediction short period according to the prediction action labels and the prediction voice texts, and synthesizing corresponding digital person animation; superposing the digital person animation in the next prediction short period with a real-time real person video stream to obtain a live video stream; and playing the live video stream through the auxiliary playing screen at the current moment.
2. The real person and digital person synchronous real time live broadcast method as claimed in claim 1, wherein, The real-time analysis is carried out on the real-person video stream, and the method for obtaining the current action sequence characteristics of the real person comprises the following steps: configuring the fine-granularity action tag library, and extracting human body key point coordinates and face key point coordinates of a real person from the real person video stream; Identifying and matching fine-grained motion labels including arm motion, hand motion, upper body motion, neck motion, head motion, mouth motion and eye motion based on the human body key point coordinates and face key point coordinates; And organizing the fine-grained action labels into action sequences according to time sequence, and marking duration and amplitude coefficients to form action sequence characteristics.
3. The real person and digital person synchronous real time live broadcast method as claimed in claim 1, wherein, The method for training the action prediction model based on the pre-utilization of the real person historical action data and the voice data comprises the following steps: Collecting live historical live video and voice thereof, and extracting multi-mode time sequence data, wherein the multi-mode time sequence data comprises a human body key point coordinate sequence, a face key point coordinate sequence, a voice language gas sequence and a voice text which change along with time; performing time axis alignment on the multi-mode time sequence data, and cutting off the multi-mode time sequence data into multi-mode characteristic data according to the predicted short period as the duration; labeling action labels for the multi-mode characteristic data, and constructing a neural network model taking action labels, voice language and gas sequences and voice texts of a current predicted short period as inputs and action labels in a next short period as outputs; Training the neural network model according to the multi-mode characteristic data after labeling the action labels, and obtaining an action prediction model according to the trained neural network model.
4. The real person and digital person synchronous real time live broadcast method as claimed in claim 1, wherein, The method for training the voice prediction model based on the live broadcast material and the real person historical live broadcast voice text comprises the following steps: Collecting voice in the live history, and recognizing to obtain a voice text; identifying semantic features, emotion tendencies and topic information according to the voice text and the corresponding live broadcast materials in the history live broadcast; aligning the voice text, semantic features, emotion tendencies and topic information according to time axes to obtain a text data sequence; Cutting off the text data sequence according to the predicted short period as the duration to obtain text data; Taking the voice text, semantic features, emotion tendencies and topic information of the text data in the current prediction short period as input, taking the actually spoken voice text in the next prediction short period as an output target, and constructing a voice content prediction model from sequence to sequence; training the voice content prediction model by using the text data, and optimizing until the generation accuracy is higher than a preset threshold value to obtain the voice prediction model.
5. The real person and digital person synchronous real time live broadcast method as claimed in claim 1, wherein, The method for driving the digital person to generate the action and the voice in the next predicted short period according to the predicted action label and the predicted voice text comprises the following steps: Mapping the predicted action label into skeleton action parameters of a digital person, and further generating limb action key frames and facial animation key frames which are matched with the action label; inputting the predicted voice text to a voice synthesis module to obtain synthesized voice; And driving a mouth shape sequence of the digital person based on the synthesized voice waveform, and matching a preset mouth animation according to the mouth shape sequence.
6. The real person and digital person synchronous real time live broadcast method as claimed in claim 5, wherein, The method for synthesizing the corresponding digital human animation comprises the following steps: generating digital human animation according to the limb action key frame, the facial animation key frame and the mouth animation; and superposing the synthesized voice to the digital person animation.
7. The real person and digital person synchronous real time live broadcast method as claimed in claim 1, wherein, The method also comprises the steps of: and receiving a time axis adjustment instruction, and superposing the digital person animation with a real-time real-person video stream after advancing or delaying according to the time axis adjustment instruction.
8. A real person and digital person synchronous real-time live broadcast system, characterized in that an auxiliary broadcast screen is arranged in front of the real person, the system comprising: The acquisition module acquires a real person video stream at the current moment, and analyzes the real person video stream in real time to acquire the current action sequence characteristics of the real person and a corresponding voice text; The first prediction module is used for outputting a predicted action label of the real person in the next predicted short period according to the action sequence characteristics based on an action prediction model which is obtained by training by utilizing the real person historical action data and the voice data in advance; The second prediction module is used for generating a predicted voice text in the next predicted short period based on a voice prediction model obtained by training live broadcast materials and real person historical live voice texts; The driving module is used for driving the digital person to generate actions and voices in the next prediction short period according to the prediction action labels and the prediction voice texts and synthesizing corresponding digital person animation; The superposition module is used for superposing the digital person animation in the next prediction short period with the real-time real person video stream to obtain a live video stream; and the playing module is used for playing the live video stream through the auxiliary playing screen at the current moment.
9. An electronic device comprising a processor and a memory; the processor is connected with the memory; the memory is used for storing executable program codes; The processor runs a program corresponding to executable program code stored in the memory by reading the executable program code for performing the method according to any one of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any of claims 1-7.

Description

Real person and digital person synchronous real-time live broadcast method and system Technical Field The application relates to the technical field of information, in particular to a real person and digital person synchronous real-time live broadcast method and system. Background In the background of rapid development of the current live broadcasting technology, live broadcasting of a real person has become an important spreading form in the fields of electronic commerce, education, entertainment and the like. However, the conventional live broadcasting has the problems of limited expressive force, single interactive form, poor content reusability and the like. Digital human technology is gradually introduced into live scenes in order to improve the immersion and intelligence level of live broadcast. The digital person has high controllability and image customization capability, but if the actions and the voices of the digital person are completely controlled by a preset script or manually, natural, smooth and real-time collaborative expression with a live host is difficult to realize. In the prior art, some schemes attempt to superimpose pictures of a real person and a digital person to enhance visual effects, but often lack deep understanding and prospective prediction of the behavior of the real person, so that the response of the digital person is lagged, the action is hard, the mouth shape is not synchronous, and the user experience is seriously affected. Accordingly, there is a need to continue to study digital human live technologies. Disclosure of Invention Various embodiments of the present specification describe a real person and digital person synchronous real-time live broadcast method and system. In a first aspect, an embodiment of the present disclosure provides a real person and digital person synchronous real-time live broadcast method, including the following steps: setting an auxiliary broadcasting screen in front of a real person; Acquiring a real person video stream at the current moment, and analyzing the real person video stream in real time to acquire the current action sequence characteristics of a real person and a corresponding voice text; Based on an action prediction model obtained by training by utilizing real person historical action data and voice data in advance, outputting a predicted action label of the real person in a next predicted short period according to the action sequence characteristics; generating a predicted voice text in the next predicted short period based on a voice prediction model obtained by training live broadcast materials and real person historical live voice texts; Driving a digital person to generate actions and voices in the next prediction short period according to the prediction action labels and the prediction voice texts, and synthesizing corresponding digital person animation; superposing the digital person animation in the next prediction short period with a real-time real person video stream to obtain a live video stream; and playing the live video stream through the auxiliary playing screen at the current moment. In a second aspect, embodiments of the present disclosure provide a real-person and digital-person synchronous real-time live broadcast system, where an auxiliary broadcast screen is set in front of a real person, the system includes: The acquisition module acquires a real person video stream at the current moment, and analyzes the real person video stream in real time to acquire the current action sequence characteristics of the real person and a corresponding voice text; The first prediction module is used for outputting a predicted action label of the real person in the next predicted short period according to the action sequence characteristics based on an action prediction model which is obtained by training by utilizing the real person historical action data and the voice data in advance; The second prediction module is used for generating a predicted voice text in the next predicted short period based on a voice prediction model obtained by training live broadcast materials and real person historical live voice texts; The driving module is used for driving the digital person to generate actions and voices in the next prediction short period according to the prediction action labels and the prediction voice texts and synthesizing corresponding digital person animation; The superposition module is used for superposing the digital person animation in the next prediction short period with the real-time real person video stream to obtain a live video stream; and the playing module is used for playing the live video stream through the auxiliary playing screen at the current moment. In a third aspect, embodiments of the present disclosure provide an electronic device comprising a processor and a memory; the processor is connected with the memory; the memory is used for storing executable program codes; The processor runs a program c