Search

KR-102961971-B1 - APPARATUS AND METHOD FOR LATENT BEHAVIOR DIFFUSION IN LISTENER REACTION GENERATION

KR102961971B1KR 102961971 B1KR102961971 B1KR 102961971B1KR-102961971-B1

Abstract

The latent behavior diffusion device and method according to the embodiment propose a novel approach to generating interactive responses by utilizing the powerful performance of a non-autoregressive (latent diffusion) model, thereby enabling the automatic generation of various facial responses corresponding to the behavior of a specific speaker, and providing the effect of enabling more natural interactive interaction modeling. In addition, by introducing a context-aware autoencoder, it provides the effect of enhancing latent space representations to effectively learn spatio-temporal features of lower facial representations.

Inventors

  • 양형정
  • 민듀크누엔
  • 김수형
  • 김승원
  • 신지은

Assignees

  • 전남대학교산학협력단

Dates

Publication Date
20260507
Application Date
20250211

Claims (20)

  1. Memory for storing at least one instruction for generating a listener response; and It includes a processor that performs an operation according to the above instruction, The above processor Synthesizing responsive facial reactions aligned with the conversational partner's behavior through the Latent Behavior Diffusion Model, and Based on the above synthesis results, human-like interaction simulations are generated, and The above potential behavior diffusion model It includes a context-aware autoencoder and a diffusion-based conditional generator, and generates contextually relevant face responses from input speaker behaviors, and The above diffusion-based condition generator A latent behavior diffusion device that operates in a latent space generated by an autoencoder and predicts realistic facial reactions in a non-autoregressive manner.
  2. delete
  3. delete
  4. delete
  5. delete
  6. In paragraph 1, the above potential behavior diffusion model A latent behavior diffusion device that generates facial responses reflecting conversational cues and subtle variations in emotional states.
  7. In paragraph 6, the above potential behavior diffusion model A latent behavior diffusion device in which an autoencoder is trained to encode a time series of listener responses through a reconstruction task.
  8. In paragraph 7, the above potential behavior diffusion model Trained to predict future goals based on speaker behavior, A latent behavior diffusion device that first generates a latent representation of a time series in a sampling step, and inputs the latent representation into a decoder to generate a future prediction.
  9. In paragraph 1, the above potential behavior diffusion model A latent behavior diffusion device that is a probabilistic generative model that learns the original data distribution by progressively denoising a variable sampled from a normal distribution, and learns the inverse step of a fixed Markov chain.
  10. In paragraph 9, the above potential behavior diffusion model A latent behavior diffusion device that uses a noise predictor to estimate noise added in the forward direction of a Markov process and denoises the estimated noise to refine the data into the original distribution.
  11. In a latent behavior diffusion method for generating a listener response by a latent behavior diffusion device, A step of synthesizing responsive facial reactions aligned with the conversational partner's behavior through a Latent Behavior Diffusion Model; The step of generating humanlike interaction simulations according to the above synthesis result; The above potential behavior diffusion model It includes a context-aware autoencoder and a diffusion-based conditional generator, and generates contextually relevant face responses from input speaker behaviors, and The above diffusion-based condition generator A latent behavior diffusion method that operates in a latent space generated by an autoencoder and predicts realistic facial reactions in a non-autoregressive manner.
  12. delete
  13. delete
  14. delete
  15. delete
  16. In Clause 11, the above potential behavior diffusion model A latent behavior diffusion method that generates facial responses reflecting conversational cues and subtle variations in emotional states.
  17. In Clause 16, the above potential behavior diffusion model A latent behavior diffusion method in which an autoencoder is trained to encode a time series of listener responses through a reconstruction task.
  18. In Clause 17, the above potential behavior diffusion model Trained to predict future goals based on speaker behavior, A latent behavior diffusion method in which a latent representation of a time series is first generated in the sampling step, and said latent representation is input into a decoder to generate a future prediction.
  19. In Clause 11, the above potential behavior diffusion model A probabilistic generative model that learns the original data distribution by progressively denoising a variable sampled from a normal distribution, and a latent behavior diffusion method that learns the inverse step of a fixed Markov chain.
  20. In Clause 19, the above potential behavior diffusion model A latent behavior diffusion method that uses a noise predictor to estimate noise added in the forward direction of a Markov process and denoises the estimated noise to refine the data to the original distribution.

Description

Apparatus and Method for Latent Behavior Diffusion in Listener Reaction Generation The present disclosure relates to a latent behavior diffusion device and method for generating listener responses, and specifically, to a device and method for generating various faces that reflect subtle changes in the conversation and emotional state between two people. Dyadic interaction refers to communication or relationships between two people and can be defined as direct and reciprocal exchange. This form of interaction is fundamental in social and psychological research and helps in understanding interpersonal dynamics, mutual influence, and the development of social bonds. Referring to the Stimulus-Organism-Response (SOR) model, each individual exhibits reactive behaviors influenced by the situations they face. In particular, speakers influence the listener's perceptions, emotions, and responses through various factors such as tone of voice, word choice, body language, and emotional expression, thereby shaping the overall communication and interaction dynamics. In recent years, there has been an increase in research focusing on the analysis of human-to-human interaction. These studies aim to analyze verbal and nonverbal cues, emotional exchanges, and the dynamics of two-way interaction to understand the complexity of interpersonal communication. The automatic generation of natural facial and body reactions that mimic the behavior of a conversation partner has been extensively conducted in various studies. These studies have primarily focused on replicating facial reactions corresponding to the input speaker's behavior. However, potential differences in nonverbal response labels for similar speaker behaviors are a critical factor in generating the listener's response. As understanding and reproducing subtle feedback from listeners has emerged as a new and interesting challenge, the field of computer vision has traditionally introduced the task of Responsive Listening Head Generation. While existing studies focused on non-verbal facial feedback provided to speakers during two-way conversations, their primary goal was to generate responses that reflected ground-truth reactions, and they typically adopted methods using deterministic models to replicate precise responses. To capture movements exhibiting various perceptually plausible non-deterministic characteristics of listeners, previous research introduced frameworks for modeling interactive communication in two-way conversations. These frameworks accept multimodal input from the speaker and generate multiple potential listener movements in an autoregressive manner. However, the one-dimensional discrete codebook used in these studies had limitations in restricting the diversity of movements and emotional expressions. Subsequent research introduced a new concept called Facial Multiple Appropriate Reaction Generation, defining it for the first time in the literature. Furthermore, it presented new objective evaluation metrics capable of assessing the appropriateness of the generated responses. Specifically, over the past few decades, research in listening reaction modeling has focused on simulating the facial expressions and head movements of engaged listeners. Previous studies pioneered data-driven approaches to generate animated characters capable of dynamically reacting to a speaker's voice. They also concentrated on generating non-verbal body behaviors. In contrast, other research has investigated the synchronized movements of conversational agents in dyadic interactions and explored methods to coordinate these movements based on speech. Furthermore, previous research proposed a method for searching for potential videos containing listeners' facial expressions using Large Language Models (LLMs). Additionally, previous studies trained a Conditional Generative Adversarial Network (Conditional GAN) to generate realistic facial response sketches based on speakers' Facial Action Units (AUs). They also proposed exploring person-specific networks, thereby enabling the reproduction of each listener's unique facial responses. Several studies have investigated methods for generating various nonverbal behaviors, such as hand gestures, posture, and facial reactions, during face-to-face interactions. Moreover, they introduced the Responsive Listening Head Generation task for the first time, which refers to the task of generating a video containing the listener's head movements based on the speaker's talking-head video and the listener's face image. Additionally, they developed the ViCo dataset to support research on this task. Their baseline approach utilized a Long Short-Term Memory (LSTM) based model to process the speaker's visual and audio data and, based on this, adopted a method to generate coefficients for the listener's 3D facial deformation model (3D Morphable Model, 3DMM). However, while these existing methods could generate listening response attributes based on the beh