CN-121999763-A - Method for generating speaking action based on action semantic reference, voice rhythm and text voice of large model technology

CN121999763ACN 121999763 ACN121999763 ACN 121999763ACN-121999763-A

Abstract

The application provides a method for generating speaking actions based on action semantic references, voice rhythms and text voices of large model technology, which comprises the steps of acquiring synchronous audio data, text data and reference action sequences; generating a sound belt vibration frequency of audio data, generating a rhythm sequence of the audio data according to the sound belt vibration frequency and the signal power of the audio data, fusing the audio characteristics of the audio data obtained through a large voice model, the text characteristics of the text data obtained through a large language model and the rhythm sequence to obtain a context characteristic, and generating a target action sequence according to the context characteristic and an action semantic vector of a reference action sequence. The embodiment of the application can output the personalized action sequence which is highly relevant to the voice content, synchronous in rhythm and consistent with the semantics of the reference action sequence.

Inventors

ZHAO TIANQI
BA JUN
MIAO YUANYUAN

Assignees

北京聚力维度科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251212

Claims (10)

1. A method for generating a speaking motion based on a motion semantic reference, a speech tempo, and text-to-speech of a large model technique, the method comprising: acquiring synchronous audio data, text data and a reference action sequence; generating a rhythm sequence of the audio data according to the vocal cord vibration frequency and the signal power of the audio data; fusing the audio characteristics of the audio data acquired through the large voice model, the text characteristics of the text data acquired through the large language model and the rhythm sequence to obtain context characteristics; and generating a target action sequence according to the context characteristics and the action semantic vector of the reference action sequence.
2. The method of claim 1, wherein after generating the target action sequence, the method further comprises: Judging whether the target action sequence is matched with the audio data or not; judging whether the action semantics of the target action sequence and the action semantics of the reference action sequence are the same or not; If the target action sequence is matched with the audio data and the text data and the action semantics of the target action sequence and the reference action sequence are the same, the target action sequence is taken as a final push action sequence; And if the target action sequence is not matched with the audio data and the text data and/or the action semantics of the target action sequence and the reference action sequence are not the same, re-executing the step of generating the target action sequence according to the context characteristics and the action semantics vector of the reference action sequence.
3. The method according to claim 1 or 2, wherein generating a cadence sequence of the audio data from the vocal cord vibration frequency and signal power of the audio data comprises: dividing the audio data into a plurality of audio frames, wherein each audio frame corresponds to a vocal cord vibration frequency and a signal power; constructing a pitch track according to the vocal cord vibration frequencies of the plurality of audio frames, and constructing a loudness track according to the signal power of the plurality of audio frames; and generating a rhythm sequence of the audio data according to the comparison result of the pitch track and the loudness track.
4. A method according to claim 3 wherein generating a cadence sequence of the audio data based on the comparison of the pitch track and the loudness track comprises: detecting a plurality of pitch peaks from the pitch track and a plurality of loudness peaks from the loudness track; If the time difference value between any pitch peak value and any loudness peak value is smaller than a time difference threshold value, taking the pitch peak value or a time point corresponding to the loudness peak value as a rhythm point; a cadence sequence is generated from the plurality of cadence points.
5. The method of claim 4, further comprising inputting the reference motion sequence to a motion semantic encoder to obtain a motion semantic vector for the reference motion sequence; The action semantic encoder performs model training by: Constructing a positive sample pair and a negative sample pair, wherein the positive sample pair refers to two action fragments with the same action semantics and different speaking moments and speaking contents; Training an initial encoder through the positive sample pair, the negative sample pair and a contrast loss function to obtain an action semantic encoder, wherein the contrast loss function is used for reducing the distance between the positive sample pair and increasing the distance between the negative sample pair in a feature space, and the action semantic encoder is used for encoding an input action sequence into an action semantic vector.
6. The method of claim 1, wherein generating a target action sequence from the contextual characteristics and the action semantic vector of the reference action sequence comprises: Inputting the predicted action sequences of the previous n-1 moments and the contextual characteristics into an autoregressive action generator to obtain the predicted action of the nth moment; And generating the target action sequence according to the predicted actions at all moments.
7. The method of claim 6, wherein inputting the predicted actions at the first n-1 moments and the contextual features into an autoregressive action generator to obtain the predicted actions at the nth moment comprises: performing self-attention processing on the predicted action sequences of the first n-1 moments to obtain processed predicted action sequences; The processed predicted action sequence is used as a query vector, the context feature is used as a key vector and a value vector to carry out cross attention processing to obtain a conditional context vector, and the conditional context vector is used for identifying the audio feature, the text feature and the rhythm feature which have highest association degree with the current moment in the context feature; Inputting the condition context vector and the motion semantic vector into a style mapping network to obtain a scale parameter and an offset parameter; performing self-adaptive instance normalization processing on the condition context vector through the scale parameter and the offset parameter to obtain a modulated vector with the same semantic style as the action semantic vector; and obtaining the predicted action of the nth moment according to the modulated vector.
8. The method according to claim 6 or 7, wherein the training step of the autoregressive motion generator comprises: The initial generator is trained to obtain the autoregressive motion generator through content loss, motion semantic loss, reconstruction loss and rhythm loss, wherein the content loss is used for representing content matching degree between a target motion sequence generated by the initial generator and input audio data and text data, the motion semantic loss is used for representing motion semantic matching degree between the target motion sequence generated by the initial generator and the input reference motion sequence, the reconstruction loss is used for representing error between predicted motion and real motion at each moment, and the rhythm loss is used for representing synchronization degree between predicted motion and rhythm point at the same moment.
9. An apparatus for generating a speaking motion based on a semantic reference of motion, a cadence of speech, and text-to-speech of a large model technique, the apparatus comprising: the data acquisition module is used for acquiring synchronous audio data, text data and a reference action sequence; the rhythm sequence generation module is used for generating a rhythm sequence of the audio data according to the vocal cord vibration frequency and the signal power of the audio data; The context feature generation module is used for fusing the audio features of the audio data acquired through the large voice model, the text features of the text data acquired through the large language model and the rhythm sequence to obtain context features; and the target action sequence generation module is used for generating a target action sequence according to the context characteristics and the action semantic vector of the reference action sequence.
10. A computer device, comprising: A memory and a processor in communication with each other, the memory having stored therein computer instructions that, upon execution, perform the method of generating a speaking action based on action semantic references, speech cadence and text-to-speech of large model technology of any one of claims 1 to 8.

Description

Method for generating speaking action based on action semantic reference, voice rhythm and text voice of large model technology Technical Field The application relates to the technical field of artificial intelligence, in particular to a method for generating speaking actions by text and speech based on action semantic references, speech rhythms and large model technology. Background With the vigorous development of the metauniverse and virtual digital man industry, how to make a virtual character perform speaking actions with specific states and attitudes naturally according to voice contents like a real person has become a key technical bottleneck for improving immersive interactive experience. The prior art scheme mainly has the following defects that 1, most methods only rely on voice and text to generate actions, so that the behavior states of all virtual roles tend to be the same, specific gestures (such as standing speaking, sitting communication, titling of old people and the like) required by setting different scenes or roles cannot be embodied, the generated digital human action semantics are uncontrollable, and the reality is lacking. 2. Existing models have difficulty accurately correlating specific limb movements (e.g., spreading hands, pointing, shaking hands) with key semantics (e.g., negation, enumeration, emphasis) in speech. Situations often occur where the motion does not match the content of the utterance, such as when a sad event is described, a relatively large magnitude of cheerful motion is generated. 3. Some studies attempt to introduce cadence control, but fail to solve the core contradiction of how to let the generated motion conform to the speech cadence, while maintaining the specific state or posture that the reference motion implies (e.g. a stable sitting posture, a modeling gait, a lazy lying down). Disclosure of Invention In view of this, the application provides a method for generating speaking actions based on action semantic references, voice rhythms and text and voice of large model technology, so as to solve the problems of single action state, weak semantic relevance, and difficult compatibility of rhythmic feeling and action semantics in the existing action generation technology. An embodiment of a first aspect of the present application provides a method for generating a speaking motion based on motion semantic reference, a speech rhythm, and text-to-speech of a large model technology, the method comprising: acquiring synchronous audio data, text data and a reference action sequence; generating a rhythm sequence of the audio data according to the vocal cord vibration frequency and the signal power of the audio data; fusing the audio characteristics of the audio data acquired through the large voice model, the text characteristics of the text data acquired through the large language model and the rhythm sequence to obtain context characteristics; and generating a target action sequence according to the context characteristics and the action semantic vector of the reference action sequence. According to the embodiment of the application, by introducing action semantics of the reference action sequence, distinct action behaviors and gestures can be given to different digital people, so that diversified and situational generation under different application scenes (such as standing lectures, sitting interviews and specific role walking) is realized. Preferably, the context feature is obtained by fusing the audio feature of the audio data acquired through the large voice model, the text feature of the text data acquired through the large language model and the rhythm sequence, and the target action sequence is generated according to the context feature and the action semantic vector of the reference action sequence, so that the personalized action sequence which is highly relevant to voice content, synchronous in rhythm and consistent with the semantics of the reference action sequence can be output. In an embodiment of the present application, after generating the target action sequence, the method further includes: Judging whether the target action sequence is matched with the audio data or not; judging whether the action semantics of the target action sequence and the action semantics of the reference action sequence are the same or not; If the target action sequence is matched with the audio data and the text data and the action semantics of the target action sequence and the reference action sequence are the same, the target action sequence is taken as a final push action sequence; And if the target action sequence is not matched with the audio data and the text data and/or the action semantics of the target action sequence and the reference action sequence are not the same, re-executing the step of generating the target action sequence according to the context characteristics and the action semantics vector of the reference action sequence. The embodiment of the app