Search

CN-121982172-A - Text-driven virtual character lip synchronous deep learning method

CN121982172ACN 121982172 ACN121982172 ACN 121982172ACN-121982172-A

Abstract

The invention discloses a synchronous deep learning method for lip shapes of a text-driven virtual character, which relates to the technical field of artificial intelligence and comprises the steps of identifying format control characters in an original text to generate a perforation mask, calculating the perforation field and extracting a visible sequence; inputting the visible sequence and the aligned perforation field into a depth sequence network, outputting repair characterization and repair uncertainty, generating pronunciation unit embedding and basic time based on the repair characterization, modulating by using the repair uncertainty to obtain final unit time, and finally generating a frame-level mouth shape parameter sequence by combining the repair uncertainty and mapping the frame-level mouth shape parameter sequence into driving quantity. The invention can inhibit the high-frequency hopping of the mouth shape and improve the robustness of lip synchronization of the live digital man.

Inventors

  • WANG XIAOWEI
  • SUN HAIDONG
  • CHEN JIA

Assignees

  • 长空数字图像科技(无锡)有限公司

Dates

Publication Date
20260505
Application Date
20260129

Claims (8)

  1. 1. The synchronous deep learning method for the lip of the text-driven virtual character is characterized by comprising the following steps of: s1, acquiring an original text character sequence, and identifying format control characters in the original text character sequence to generate a perforation mask; S2, calculating a perforation field based on a perforation mask, and extracting a visible sequence and an alignment perforation field aligned with the visible sequence from an original text character sequence; s3, inputting the visible sequence and the aligned perforation field into a depth sequence network, and outputting repair characterization and repair uncertainty; S4, generating a pronunciation unit embedding and unit basic time length based on the repair characterization, and adjusting the unit basic time length according to the repair uncertainty to obtain a final unit time length; s5, embedding the pronunciation unit into a time dimension for unfolding according to the final unit duration, and generating a frame-level mouth shape parameter sequence by combining with the repair uncertainty; and S6, mapping the frame-level mouth shape parameter sequence into virtual character driving quantity.
  2. 2. The text-driven avatar lip synchronous deep learning method of claim 1, wherein identifying format control characters in the original text character sequence to generate a puncture mask comprises: traversing each character in the original text character sequence; Judging whether the current character belongs to a preset format control character set, if so, setting the value of the perforation mask at the corresponding position to be 1, otherwise, setting the value of the perforation mask at the corresponding position to be 0.
  3. 3. The text-driven avatar lip synchronous deep learning method of claim 2, wherein calculating a puncture field based on a puncture mask and extracting a visible sequence and an aligned puncture field aligned with the visible sequence from an original text character sequence comprises: Adaptively determining the size of a sliding window according to the length of an original text character sequence; calculating the cumulative value of the perforation masks in the coverage range of the sliding window for each position in the original text character sequence; Normalizing the accumulated value to obtain a perforation field; And extracting the character corresponding to the position with the perforation mask of 0 to form a visible sequence, and extracting the numerical value components of the perforation field at the same position to form the perforation field.
  4. 4. The text-driven avatar lip synchronized deep learning method of claim 1, wherein inputting the visual sequence into the depth sequence network with the aligned perforation field comprises: performing embedding vectorization on visible characters in a visible sequence to obtain character embedding, and generating corresponding time position codes; Mapping the aligned puncturing field into a puncturing feature vector; and (3) character embedding, time position coding and perforation feature vector splicing are carried out, and a depth sequence network is input.
  5. 5. The text-driven avatar lip synchronous deep learning method of claim 1, wherein outputting the repair characterization and repair uncertainty comprises: Outputting the repair probability distribution of each time step through a depth sequence network; carrying out weighted summation on the repair probability distribution to obtain a repair characterization; and calculating entropy of the repair probability distribution, and taking the entropy as repair uncertainty.
  6. 6. The text-driven virtual character lip synchronous deep learning method of claim 1, wherein adjusting the unit base time length according to a repair uncertainty comprises: calculating the mean value of the repair uncertainty of all time steps in the visible sequence; for each time step, calculating the ratio of the repair uncertainty of the current time step to the mean value of the repair uncertainty to obtain a modulation factor; And obtaining the final unit duration according to the product of the unit basic duration and the modulation factor.
  7. 7. The text-driven virtual character lip synchronous deep learning method of claim 1, wherein generating a frame-level mouth shape parameter sequence in combination with repair uncertainty comprises: determining the corresponding frame number of each pronunciation unit according to the final unit duration; Generating the relative phase of each frame in the affiliated pronunciation unit and converting the relative phase into phase embedding; mapping the repair uncertainty into uncertainty embedding; and splicing the pronunciation unit embedding, the phase embedding and the uncertainty embedding, and inputting the pronunciation unit embedding, the phase embedding and the uncertainty embedding into a depth generation network to obtain a frame-level mouth shape parameter sequence.
  8. 8. The text-driven virtual character lip synchronous deep learning method of claim 1, wherein mapping the frame-level mouth shape parameter sequence to the virtual character driving amount comprises: And converting the frame-level mouth shape parameter sequence into skeleton control quantity or expression mixed shape coefficient of the virtual character through the linear mapping matrix and the offset vector.

Description

Text-driven virtual character lip synchronous deep learning method Technical Field The invention relates to the technical field of artificial intelligence, in particular to a synchronous deep learning method for lip shapes of text-driven virtual roles. Background With the development of artificial intelligence and computer vision technology, virtual digital people have been widely used in live broadcast, barrage interaction, commodity information broadcasting and other scenes. Such systems typically receive text generated from a live barrage or commodity information stream as input, generate driving parameters through pronunciation unit modeling and duration prediction, and then control the mouth motion of the virtual character to achieve a speaking animation. However, in practical application, the source of the input text is complex and non-standard, besides the conventional visible characters, unicode format control characters invisible to human eyes are often mixed in the text, although the characters are not easy to be perceived at the display level, the bottom discrete structure of the character string can be changed, so that invisible fracture occurs between the visual content and the computer processing sequence, the structural difference can damage the continuity of segmentation of the segmentation words and the subwords, unknown fragments or boundary abnormality can be caused in the generation link of the pronunciation units, and further, the finally generated lip animation generates unnatural jitter or time sequence dislocation, so that the visual experience of digital people is seriously affected. Prior art techniques typically employ rule-based preprocessing cleanup or simple regular filtering approaches in processing such non-canonical text. The processing means is difficult to consider the integrity of the text index, and the original position information is easy to lose, so that the follow-up driving links cannot be aligned accurately. Meanwhile, the existing deep learning model lacks an explicit measurement mechanism aiming at the invisible interference, disturbance intensity and restoration ambiguity caused by invisible characters on a neighborhood sequence cannot be calculated effectively, and due to the lack of quantization and modulation means for structural uncertainty, the existing model cannot adaptively adjust a generation strategy according to the interference degree when the existing model is used for time duration expansion and parameter generation, so that high-frequency jump or neutral collapse of mouth shape parameters easily occurs when the existing model faces noise input, and the requirements for synchronous high robustness and high naturalness of lip shapes of virtual roles in a live broadcast scene are difficult to meet. Disclosure of Invention The invention aims to solve the defects that in the prior art, invisible perforation formed by format control characters in an input text cannot be effectively applied to disturbance of a sequence structure, so that repairing ambiguity exists in a pronunciation unit modeling and duration unfolding stage, and further, high-frequency jump and unnatural time sequence occur in a generated virtual character lip synchronous animation, and provides a text-driven virtual character lip synchronous deep learning method. In order to solve the problems existing in the prior art, the invention adopts the following technical scheme: The text-driven virtual character lip synchronous deep learning method comprises the following steps: s1, acquiring an original text character sequence, and identifying format control characters in the original text character sequence to generate a perforation mask; S2, calculating a perforation field based on a perforation mask, and extracting a visible sequence and an alignment perforation field aligned with the visible sequence from an original text character sequence; s3, inputting the visible sequence and the aligned perforation field into a depth sequence network, and outputting repair characterization and repair uncertainty; S4, generating a pronunciation unit embedding and unit basic time length based on the repair characterization, and adjusting the unit basic time length according to the repair uncertainty to obtain a final unit time length; s5, embedding the pronunciation unit into a time dimension for unfolding according to the final unit duration, and generating a frame-level mouth shape parameter sequence by combining with the repair uncertainty; and S6, mapping the frame-level mouth shape parameter sequence into virtual character driving quantity. Preferably, identifying the format control characters in the original text character sequence to generate a puncture mask includes: traversing each character in the original text character sequence; Judging whether the current character belongs to a preset format control character set, if so, setting the value of the perforation mask at the corresponding posit