US-12626689-B2 - Determining emotion sequences for speech for conversational AI systems and applications

US12626689B2US 12626689 B2US12626689 B2US 12626689B2US-12626689-B2

Abstract

In various examples, determining emotion sequences for speech in conversational AI systems and applications is described herein. Systems and methods are disclosed that use one or more first machine learning models to determine a sequence of emotional states associated with audio data representing speech. To use the first machine learning model(s), the systems and methods may train the first machine learning model(s) using one or more second machine learning models, where the second machine learning model(s) is trained to determine scores indicating accuracies associated with sequences of emotional states. For instance, the second machine learning model(s) may be trained to determine the scores using audio data representing speech, sequences of emotional states associated with the speech, and indications of which sequences of emotional states better represent the speech as compared to other sequences of emotional states.

Inventors

Ilia Fedorov
Dmitry Korobchenko

Assignees

NVIDIA CORPORATION

Dates

Publication Date: 20260512
Application Date: 20230801

Claims (20)

1 . A method comprising: determining, using one or more first machine learning models and based at least on first audio data representative of first speech, a first sequence of emotional states, wherein the one or more first machine learning models are trained, at least, by: determining, based at least on the one or more first machine learning models processing second audio data representative of second speech, a second sequence of emotional states associated with the second speech; determining, based at least on one or more second machine learning models processing the second audio data and input data representative of the second sequence of emotional states, a score associated with the second sequence of emotional states as determined using the one or more first machine learning models; and updating one or more parameters of the one or more first machine learning models based at least on the score determined using the one or more second machine learning models.
2 . The method of claim 1 , wherein the first sequence of emotional states indicates at least a first emotional state associated with a first portion of the first audio data and a second emotional state associated with a second portion of the first audio data.
3 . The method of claim 1 , wherein the score indicates a similarity between one or more predicted emotional states associated with the second sequence of emotional states and one or more actual emotional states associated with the second speech represented by the second audio data.
4 . The method of claim 1 , wherein the updating the one or more first machine learning models comprises: determining a loss based at least on the score; and updating, based at least on the loss, the one or more parameters associated with the one or more first machine learning models.
5 . The method of claim 1 , wherein the one or more first machine learning models are further trained, at least, by: generating training data representative of at least third speech, a third sequence of emotional states associated with the third speech, a fourth sequence of emotional states associated with the third speech, and an indication that the third sequence of emotional states better reflects the third speech than the fourth sequence of emotional states; and updating one or more second parameters of the one or more second machine learning models based at least on the training data.
6 . The method of claim 5 , wherein the generating the training data comprises: generating, based at least on third audio data representative of the third speech and the third sequence of emotional states, first video data representative of a first video depicting a first animation associated with the third sequence of emotional states; generating, based at least on the third audio data representative of the third speech and the fourth sequence of emotional states, second video data representative of a second video depicting a second animation associated with the fourth sequence of emotional states; receiving second input data representative of a selection that the first video better represents the third speech as compared to the second video; and generating the training data based at least on the third audio data, the third sequence of emotional states, the fourth sequence of emotional states, and the selection.
7 . The method of claim 5 , wherein the updating the one or more second parameters of the one or more second machine learning models comprises: determining, using the one or more second machine learning models and based at least on third audio data representative of the third speech and the third sequence of emotional states, a second score associated with the third sequence of emotional states; determining, using the one or more second machine learning models and based at least on the third audio data representative of the third speech and the fourth sequence of emotional states, a third score associated with the fourth sequence of emotional states; and updating, based at least on the second score, the third score, and the indication that the third sequence of emotional states better reflects the third speech as compared to the fourth sequence of emotional states, the one or more second parameters of the one or more second machine learning models.
8 . The method of claim 1 , wherein the one or more first machine learning models are further trained, at least, by: determining, using one or more third machine learning models and based at least on the second audio data, a third sequence of emotional states, wherein the updating the one or more parameters of the one or more first machine learning models is further based at least on the third sequence of emotional states.
9 . A system comprising: one or more processors to: train, using first audio data representative of first speech and one or more first sequences of emotional states associated with the first speech, one or more first machine learning models to determine one or more scores associated with one or more second sequences of emotional states; and update, using the one or more first machine learning models and based at least on second audio data representative of second speech and the one or more second sequences of emotional states determined using one or more second machine learning models processing input data associated with the second speech, one or more parameters of the one or more second machine learning models to determine one or more third sequences of emotional states associated with third speech.
10 . The system of claim 9 , wherein the one or more processors are further to determine, using the one or more second machine learning models and based at least on third audio data representative of the third speech, the one or more third sequences of emotional states associated with the third speech.
11 . The system of claim 9 , wherein the one or more second machine learning models are trained, at least, by: determining, using the one or more second machine learning models and based at least on the input data, the one or more second sequences of emotional states associated with the second speech; determining, using the one or more first machine learning models and based at least on the second audio data and the one or more second sequences of emotional states, the one or more scores associated with the one or more second sequences of emotional states; and updating the one or more parameters of the one or more second machine learning models based at least on the one or more scores.
12 . The system of claim 11 , wherein the updating the one or more parameters of the one or more second machine learning models comprises: determining one or more losses based at least on the one or more scores; and updating, based at least on the one or more losses, the one or more parameters of the one or more second machine learning models.
13 . The system of claim 9 , wherein the one or more first machine learning models are trained, at least, by: determining, using the one or more first machine learning models and based at least on the first audio data and the one or more first sequences of emotional states, one or more scores associated with the one or more first sequences of emotional states; and updating one or more parameters of the one or more first machine learning models based at least on the one or more scores.
14 . The system of claim 13 , wherein the one or more first machine learning models are further trained, at least, by: generating training data indicating that at least a first sequence of emotional states of the one or more first sequences of emotional states better represents the first speech as compared to a second sequence of emotional states of the one or more first sequences of emotional states, wherein the updating the one or more parameters of the one or more first machine learning models is further based at least on the training data.
15 . The system of claim 14 , wherein the generating the training data comprises: generating, based at least on the first audio data representative of the first speech and the first sequence of emotional states, first video data representative of a first video depicting a first animation associated with the first sequence of emotional states; generating, based at least on the first audio data representative of the first speech and the second sequence of emotional states, second video data representative of a second video depicting a second animation associated with the second sequence of emotional states; receiving input data representative of a selection that the first video better represents the first speech as compared to the second video; and generating the training data based at least on the input data.
16 . The system of claim 9 , wherein the system is comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system implemented using one or more large language models; a system for performing conversational AI operations; a system for generating synthetic data; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
17 . One or more processors comprising processing circuitry to: determine, using one or more first machine learning models and based at least on audio data representative of first speech, a first sequence of emotional states associated with the first speech, wherein the one or more first machine learning models are trained using one or more second machine learning models that determine one or more scores associated with one or more second sequences of emotional states that are determined using the one or more first machine learning models processing input data associated with one or more instances of second speech.
18 . The one or more processors of claim 17 , wherein the one or more first machine learning models are further trained using one or more third machine learning models that also determine the one or more second sequences of emotional states.
19 . The one or more processors of claim 17 , wherein the one or more processors are comprised in at least one of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system implemented using one or more large language models; a system for performing conversational AI operations; a system for generating synthetic data; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
20 . The method of claim 1 , wherein: the determining the second sequence of emotional states comprises generating, by the one or more first machine learning models and based at least on the processing the second audio data representative of the second speech, the input data representative of the second sequence of emotional states associated with the second speech; the determining the score comprises generating, by the one or more second machine learning models and based at least on the processing the second audio data and the input data representative of the second sequence of emotional states, output data representative of the score associated with the second sequence of emotional states as determined using the one or more first machine learning models.

Description

BACKGROUND Many applications, such as gaming applications, interactive applications, communications applications, multimedia applications, and/or the like, use animated characters or digital avatars that interact with users of the applications and/or other animated characters within the applications. In order to provide more realistic experiences for the users, some animated characters interact using both audio, such as speech, as well as visual indicators. For example, when an animated character is interacting with a user, an application may both sync the lip movements of the animated character with speech being output by the animated character while also causing the animated character to visually express facial emotions. Visually expressing facial emotions may include causing the animated character to move various features of the face, such as the eyes, the mouth, the eyebrows, the nose, the cheeks, and/or other features of the face. As such, various techniques have been developed to determine emotions associated with speech that is output by animated characters. For example, a conventional system may process audio data representing user speech using a machine learning model. The machine learning model may then determine, based at least on the processing, an emotional state associated with the speech. However, based on a length of the speech, a context of the speech, and/or other factors associated with the speech, the speech may actually be associated with multiple emotional states. For example, a user that outputs the speech may include a first emotional state, such as happy, during a first part of the speech and a second emotional state, such as angry, during a second part of the speech. As such, by determining only a single emotional state for the speech, the output from the machine learning model may not be adequate enough for animating a character in a way that accurately expresses these changes in emotion associated with the speech. SUMMARY Embodiments of the present disclosure relate to determining emotion sequences for speech in conversational artificial intelligence (AI) systems and applications. Systems and methods are disclosed that use one or more first machine learning models to determine a sequence of emotional states associated with audio data representing speech. To use the first machine learning model(s), the systems and methods may train the first machine learning model(s) using one or more second machine learning models, where the second machine learning model(s) is trained to determine scores indicating accuracies associated with sequences of emotional states. For instance, the second machine learning model(s) may be trained to determine the scores using audio data representing speech, sequences of emotional states associated with the speech, and/or indications of which sequences of emotional states better represent the speech as compared to other sequences of emotional states. The second machine learning model(s) may then process outputs from the first machine learning model(s), such as sequences of emotional states, to determine scores associated with the outputs. Additionally, the first machine learning model(s) may be trained based at least on the scores. In contrast to conventional systems, such as those described above, the current systems, in some embodiments, use the first machine learning model(s) that is able to determine a sequence of emotional states associated with speech rather than just a single emotional state. As described herein, by determining the sequence of emotional states, the emotional states determined by the first machine learning model(s) may better represent actual emotions associated with the speech as compared to the single emotional state. For instance, in some examples, the sequence of emotional states may better represent the speech since actual humans change emotional states while speaking, such as express different emotions for different parts of the speech. As such, an animated character that is outputting speech should also change emotional states while outputting the speech rather than just maintaining a single emotional state. BRIEF DESCRIPTION OF THE DRAWINGS The present systems and methods for determining emotion sequences for speech in conversational AI systems and applications are described in detail below with reference to the attached drawing figures, wherein: FIG. 1 illustrates an example data flow diagram for a process of training one or more machine learning models to determine sequences of emotional states associated with speech, in accordance with some embodiments of the present disclosure; FIG. 2 illustrates an example of generating training data using audio data representing speech and sequences of emotional states associated with the speech, in accordance with some embodiments of the present disclosure; FIG. 3 illustrates an example of training one or more machine learning models to generate scores indicating whether sequences of em