KR-20260065719-A - Electronic device, method, and non-transient computer-readable storage medium for changing the pose of an avatar

KR20260065719AKR 20260065719 AKR20260065719 AKR 20260065719AKR-20260065719-A

Abstract

The electronic device may include a memory for storing instructions, a speaker, a display, and at least one processor. When the instructions are executed by the at least one processor, the electronic device may cause the device to display, through the display, an avatar of a first pose for interacting with a user of the electronic device, and to obtain text data including a natural language response expressed in text by providing information related to user data to a trained first model based on identifying user data for interaction between the electronic device and the user, and to obtain second pose data representing a second pose continuous with the first pose by providing audio data and first pose data representing a first pose obtained using the text data to a trained second model, and to display, through the display, an animation of an avatar changing from a first pose to a second pose using the second pose data when a voice corresponding to the text data is output through the speaker.

Inventors

김선우
장민욱

Assignees

주식회사 엔씨

Dates

Publication Date: 20260511
Application Date: 20241028

Claims (20)

In electronic devices, Memory comprising one or more storage media and storing instructions; speaker; Display; and It includes at least one processor comprising processing circuitry, and When the above instructions are executed individually or collectively by the at least one processor, Displaying an avatar of a first pose for interacting with a user of the electronic device through the display, and Based on identifying user data for interaction between the electronic device and the user, information related to the user data is provided to a trained first model to obtain text data including a natural language response expressed in text, and By providing audio data obtained using the text data and first pose data representing the characteristics of the first pose to a trained second model, first joint data representing a second pose continuous with the first pose is obtained, and When outputting voice corresponding to the above text data through the speaker, an animation of the avatar changing from the first pose to the second pose using the first joint data is displayed through the display. causing the above electronic device, Electronic device.
In claim 1, the audio data is, including feature data representing the characteristics of the audio signal through the MFCC (Mel-frequency cepstral coefficients) technique from the audio signal corresponding to the text data, Electronic device.
In claim 1, the audio data is, A bit data including bit data indicating the start time of the voice within a bit divided according to the time interval of the audio signal corresponding to the text data, Electronic device.
In claim 1, When the above instructions are executed individually or collectively by the at least one processor, By providing the above audio data and the above first pose data to the above second model, the second pose data of the avatar and the first latent data for the second pose data are obtained, and By providing the second pose data and the first potential data to the third model, the first joint data representing the second pose is obtained. causing the above electronic device, Electronic device.
In claim 4, When the above instructions are executed individually or collectively by the at least one processor, By providing the second pose data and the first potential data to the fourth model, in order to obtain third pose data and second potential data for generating second joint data representing a third pose continuous with the second pose, causing the above electronic device, Electronic device.
In claim 5, the third model is, Using pose data representing the position and rotation of the joints of the avatar and latent data regarding the pose of the avatar, the avatar is trained to output joint data regarding the position of all joints of the avatar, and The above fourth model is, Trained to output other pose data and other potential data for a pose consecutive to another pose of the avatar using the above pose data and the above potential data, Electronic device.
In claim 1, the first model is, Using information regarding the image and/or voice of the user, the system is trained to output the text data corresponding to the information, and The above second model is, A plurality of discrimination models for identifying the authenticity of pose data generated based on data or joint data generated based on said pose data, Electronic device.
In claim 1, the user data is, including images of the user obtained through a camera and another voice of the user obtained through a microphone, Electronic device.
In electronic devices, Memory comprising one or more storage media and storing instructions; speaker; Display; and It includes at least one processor comprising processing circuitry, and When the above instructions are executed individually or collectively by the at least one processor, Displaying an avatar of a first pose for interacting with a user of the electronic device through the display, and Based on identifying user data for interaction between the electronic device and the user, by providing information related to the user data to a trained first model, response data based on the information related to the user data is obtained, and Based on the determination that the above response data includes text data containing a natural language response expressed as text and includes a directive indicating a second pose of the avatar, when a voice corresponding to the text data is output through the speaker, a first animation of the avatar changing from the first pose to the second pose through the third pose is displayed through the display by using first joint data indicating a third pose between the first pose and the second pose. Based on the decision that the above response data includes the above text data and does not include the above directive: By providing audio data obtained using the text data and first pose data representing the characteristics of the first pose to a trained second model, second joint data representing a fourth pose continuous with the first pose is obtained, and When the voice corresponding to the above text data is output through the speaker, a second animation for the avatar changing from the first pose to the fourth pose using the second joint data is displayed through the display. causing the above electronic device, Electronic device.
In claim 9, When the above instructions are executed individually or collectively by the at least one processor, Based on the determination that the above response data includes the text data and includes the directive representing the second pose, to retrieve the first joint data representing the third pose from a database for poses between the first pose and the second pose. causing the above electronic device, Electronic device.
In claim 9, When the above instructions are executed individually or collectively by the at least one processor, Based on the completion of the display of the first animation above, retrieve third joint data representing a fifth pose between the second pose and the first pose from another database regarding poses between the second pose and the first pose, and Using the third joint data above, to display, through the display, a third animation for the avatar changing from the second pose to the first pose through the fifth pose, causing the above electronic device, Electronic device.
In claim 9, the audio data is, including feature data representing the characteristics of the audio signal through the MFCC (Mel-frequency cepstral coefficients) technique from the audio signal corresponding to the text data, Electronic device.
In claim 9, the audio data is, A bit data including bit data indicating the start time of the voice within a bit divided according to the time interval of the audio signal corresponding to the text data, Electronic device.
In claim 9, When the above instructions are executed individually or collectively by the at least one processor, By providing the above audio data and the above first pose data to the above second model, the second pose data of the avatar and the first latent data for the second pose data are obtained, and By providing the second pose data and the first potential data to the third model, in order to obtain third pose data and second potential data for generating third joint data representing a fifth pose continuous with the fourth pose, causing the above electronic device, Electronic device.
In claim 14, When the above instructions are executed individually or collectively by the at least one processor, By providing the second pose data and the first potential data to the fourth model, the second joint data representing the fourth pose is obtained. causing the above electronic device, Electronic device.
In claim 9, the second model is, A plurality of discrimination models for identifying the authenticity of pose data generated based on data or joint data generated based on said pose data, Electronic device.
In claim 9, the user data is, including images of the user obtained through a camera and another voice of the user obtained through a microphone, Electronic device.
In a non-transient computer-readable storage medium storing one or more programs, said one or more programs are, When executed by an electronic device having a speaker and a display, Displaying an avatar of a first pose for interacting with a user of the electronic device through the display, and Based on identifying user data for interaction between the electronic device and the user, information related to the user data is provided to a trained first model to obtain text data including a natural language response expressed in text, and By providing audio data obtained using the text data and first pose data representing the characteristics of the first pose to a trained second model, first joint data representing a second pose continuous with the first pose is obtained, and When outputting voice corresponding to the above text data through the speaker, an animation of the avatar changing from the first pose to the second pose using the first joint data is displayed through the display. Including instructions that cause the above electronic device, Non-transient computer-readable storage media.
In claim 18, When the above one or more programs are executed by the electronic device, By providing the above audio data and the above first pose data to the above second model, the second pose data of the avatar and the first latent data for the second pose data are obtained, and By providing the second pose data and the first potential data to the third model, the first joint data representing the second pose is obtained. Including instructions that cause the above electronic device, Non-transient computer-readable storage media.
In claim 19, When the above one or more programs are executed by the electronic device, By providing the second pose data and the first potential data to the fourth model, in order to obtain third pose data and second potential data for generating second joint data representing a third pose continuous with the second pose, Including instructions that cause the above electronic device, Non-transient computer-readable storage media.

Description

Electronic device, method, and non-transient computer-readable storage medium for changing the pose of an avatar Electronic device, method, and non-transient computer-readable storage medium for changing the pose of an avatar The present disclosure relates to an electronic device, a method, and a non-transient computer-readable storage medium for changing the pose of an avatar. The electronic device may include a display. The electronic device may display an avatar for interacting with a user of the electronic device through the display. The electronic device may control the form of the avatar based on receiving user input. The electronic device may display an animation of the avatar's pose changing through the display. The electronic device may display the animation through the display based on receiving user input that causes the avatar's pose to change. The information described above may be provided as related art for the purpose of aiding understanding of the present disclosure. No claim or determination is made as to whether any of the foregoing can be applied as prior art related to the present disclosure. Figure 1 illustrates an example of an environment including an electronic device that displays an avatar. Figure 2 is a simplified block diagram of an exemplary electronic device. Figure 3 is a flowchart illustrating the operation of an electronic device that displays animation using a model. FIGS. 4a and 4b illustrate an exemplary operation of an electronic device that acquires joint data of an avatar using a model. Figure 5 is a flowchart illustrating the operation of an electronic device that acquires joint data using pose data and potential data. FIG. 6 illustrates an exemplary operation of an electronic device that generates pose data using audio data. Figure 7 is a flowchart illustrating the operation of an electronic device that displays an animation using response data containing a directive. FIGS. 8A and 8B illustrate exemplary operation of an electronic device that displays animation using a database of poses. Figure 1 illustrates an example of an environment including an electronic device that displays an avatar. Referring to FIG. 1, the environment (150) may include an electronic device (100) and a user (120). The electronic device (100) may be used to display an avatar (160). For example, the avatar (160) may be described as an interface for interacting with the user (120). For example, the electronic device (100) may display the avatar (160) differently as it receives user input. For example, the electronic device (100) may display the avatar (160) interacting with the user (120) through a display (e.g., the display (208) in FIG. 2) based on identifying information of the user (120). An avatar (160) may be used to interact with a user (120). For example, the avatar (160) may be used to converse with the user (120). For example, the conversation may include verbal and non-verbal elements (e.g., gestures, poses). For example, the avatar (160) may mimic a person. For example, the avatar (160) may be used to provide a realistic conversational experience to the user (120). For example, an electronic device (100) capable of displaying the avatar (160) may be required to interact with the user (120) in various ways. For example, the electronic device (100) may provide an enhanced user experience to the user (120) based on providing an avatar (160) capable of interacting in various ways. For example, the electronic device (100) can cause the user (120) to immerse himself in the avatar (160) by displaying the avatar (160) having various gestures for communication on the display. For example, an electronic device (100) may provide an avatar (160) capable of interacting with a user (120) based on a turn-based framework. For example, the turn-based framework may be performed based on an idle state, a listening state, and a speaking state. For example, the turn-based framework may not provide the same sensation as human conversation. For example, a person may communicate using facial expressions, gestures, and voice effects. For example, a person may interrupt another person while they are speaking. For example, a person may pass their turn to speak, nod, or shake their head. For example, the electronic device (100) may be required to acquire images of the user (120). For example, the electronic device (100) may be required to acquire information about the user's (120) voice. For example, the electronic device (100) may be required to provide an avatar (160) capable of communicating with the user (120) through multiple modalities. For example, the electronic device (100) may be required to provide an avatar (160) that is not limited to turn-based logic. For example, the electronic device (100) may provide an avatar (160) for interacting with a user (120) by using trained model(s). For example, the electronic device (100) may use a model trained to acquire joint data representing the pose of the a