CN-121997254-A - Intelligent companion-oriented emotional personification multi-modal voice interaction large model system

CN121997254ACN 121997254 ACN121997254 ACN 121997254ACN-121997254-A

Abstract

The invention provides an intelligent companion-oriented emotional personification multi-modal voice interaction large model system, which relates to the field of language processing and is characterized by comprising a user interaction module, a multi-modal data fusion module, a database module, a data analysis module and a function processing module, wherein the user interaction module comprises touch interaction, voice interaction, text interaction and local interaction. The intelligent accompanying toy emotion interaction template system has the advantages that through core technologies such as integration of multi-mode data fusion, layered memory and RAG, dynamic emotion response, edge-cloud cooperation and the like, the problems of existing intelligent accompanying toy emotion interaction templates, missing of role consistency, high response delay, insufficient utilization of complex modes, weak long-term memory capacity and the like are solved systematically, the reality sense, individuation degree and instantaneity of interaction are remarkably improved, and accordingly high-quality accompanying experience with emotion value and immersion sense is provided for users.

Inventors

CHEN WEITAO
LIN JIANG
Weng xinyi
YI ZILI
HU YUAN

Assignees

苏州扁平大陆科技有限责任公司

Dates

Publication Date: 20260508
Application Date: 20260112

Claims (9)

1. The intelligent companion-oriented emotional personification multi-modal voice interaction large model system is characterized by comprising a user interaction module, a multi-modal data fusion module, a database module, a data analysis module and a function processing module, wherein the user interaction module comprises touch interaction, voice interaction, text interaction and local interaction, the multi-modal data fusion module is connected with the user interaction module, a cross-modal attention mechanism is adopted to conduct feature extraction, alignment and fusion on multi-modal interaction data to generate unified multi-modal feature representation, the multi-modal data fusion module comprises feature extraction, data alignment and fusion algorithm, the database module is responsible for storage and management of multi-dimensional data, long-term memory and personalized service of an intelligent agent is supported, the data analysis module is connected with the database module and comprises storage of user and cloud information, user preference records and multi-modal history interaction records, multi-source data of the data analysis database are relied on, deep mining of user behaviors and demands is achieved, the multi-modal feature analysis and preference dynamic update are respectively connected with the multi-modal data fusion module and the multi-modal feature analysis module, the face analysis module comprises face analysis module, the face analysis module and the emotion interaction platform are used for achieving feedback, and the emotion interaction strategy adjustment and the long-term co-modal interaction processing module are used for achieving dynamic image mining.
2. The intelligent companion-oriented emotional personification multi-modal voice interaction large model system according to claim 1, wherein the touch interactions are used for collecting user touch force, position and frequency and converting the touch interactions into interaction instructions, the voice interactions are used for collecting text content and acoustic characteristics of user voice through an automatic voice recognition technology, the text interactions support APP, device-side text input, collecting user questions, chat content and semantic tags, and local interactions (device-side lightweight interactions) and cloud interactions (complex task cloud processing).
3. The intelligent companion-oriented emotion personification multi-modal voice interaction large model system is characterized in that a cross-modal attention mechanism in the multi-modal data fusion module can map multi-modal features such as touch dynamics, emotion, voice intonation, semantics, image vision, scene and the like to a shared semantic space to realize inter-modal information linkage, and a fusion algorithm in the multi-modal data fusion module adopts a multi-modal transducer and a dynamic fusion network to integrate the modal features to generate unified representation, so that a multi-dimensional decision basis is provided for function processing.
4. The intelligent companion-oriented emotional personification multi-modal speech interaction large model system according to claim 1, wherein the function processing module retrieves related history information and character attributes from the long-term memory layer of the database module by a retrieval enhancement generation method and inputs them as a context to the large language model to generate a response conforming to the character setting and having long-term consistency.
5. The intelligent companion-oriented emotion personification multi-modal voice interaction large model system according to claim 1 is characterized in that the system adopts an edge and cloud collaborative reasoning architecture, wherein edge equipment is responsible for executing basic emotion recognition, simple instruction response and local data encryption, a cloud server is responsible for executing complex multi-modal fusion analysis, long text generation and model training, and the edge equipment and the cloud server are collaborative through a task scheduling algorithm and perform data transmission through a homomorphic encryption technology.
6. The intelligent companion-oriented emotion personification multi-modal voice interaction large model system according to claim 1, wherein the function processing module further comprises a reply strategy dynamic adjustment sub-module for selecting or generating adaptive response content and style from a preset reply strategy library according to real-time fused emotion intensity and multi-modal input signals.
7. The intelligent companion-oriented emotional personification multi-modal voice interaction large model system according to claim 1, wherein the database module adopts a storage mechanism of hierarchical memory architecture adopting short-term-medium-long term hierarchical memory and completes information retrieval through a retrieval enhancement generation method (RAG), and the database module and the data analysis module are combined to realize user portrait dynamic update and a memory-based personalized recommendation algorithm.
8. The method for multi-modal voice interaction of the system according to claim 1, comprising the steps of firstly, collecting multi-modal interaction data of touch sense, voice, image and text of a user through a user interaction module; Step two, performing feature fusion on the multi-modal interaction data by adopting a cross-modal attention mechanism through a multi-modal data fusion module to generate uniform multi-modal feature representation; step three, constructing and updating a user image based on the historical data stored in the database module through the data analysis module; And step four, generating and executing personified multi-modal feedback based on the multi-modal feature representation and the user portrait through a function processing module, and calling long-term memory through a retrieval enhancement generation technology in the generation process to keep the consistency of roles.
9. The method of claim 8, wherein in the second step, the haptic signal fusion specifically includes mapping the touch strength, trajectory and duration information to emotion strength sequences through a deep learning model, and performing cross-modal semantic alignment with intonation features in the voice and expression features in the image.

Description

Intelligent companion-oriented emotional personification multi-modal voice interaction large model system Technical Field The invention relates to the field of language processing, in particular to an intelligent companion-oriented emotional personification multi-modal voice interaction large model system. Background In recent years, the rapid development of artificial intelligence technology has prompted multi-modal voice interactions to become a research hotspot in the field of intelligent companion toys. With the advent of large open source models such as DeepSeek, llama and the like, the AI toy development threshold is remarkably reduced, so that the AI toy is upgraded from simple voice interaction to intelligent partners with long-term memory, emotion calculation and scene self-learning capabilities. The product form is also expanded from traditional robot to multiple forms such as plush toy, hand office, story machine, and application scene extends from children education to adult accompany and silver care full-age market more. Along with the rapid development of artificial intelligence and multi-mode interaction technology, intelligent accompanying toys gradually become important product types meeting the emotion accompanying demands of users. At present, part of intelligent accompanying toys adopt a multi-mode processing scheme, integrate multi-mode input such as hearing, vision, human face recognition, voice perception, touch perception and the like, and attempt to realize the collaborative processing of multi-mode information through a related algorithm so as to improve the interactive experience with users. At present, the intelligent accompanying toy is used as an important application of artificial intelligence in the field of man-machine interaction, and although the basic interaction function is realized based on single-mode technologies such as voice recognition, natural language processing and the like, the intelligent accompanying toy still faces the key technical bottlenecks that the interaction is stiff and rigid due to insufficient emotional personification degree, emotion resonance is lacked, single-mode emotion recognition precision is limited and a dynamic weight distribution mechanism is lacked, timing sequence alignment and feature matching problems exist in multi-mode data synchronous processing, information processing is indirect and complicated due to insufficient cross-mode understanding capability, personalized service is realized due to lack of accurate user portrait technology, user viscosity is insufficient due to weak multi-round interaction context memory capability, and real-time interaction experience is poor due to model reasoning and network delay. These technical problems restrict the further development and industrial application of intelligent accompanying toys, and new solutions are needed to break through the existing limitations. Disclosure of Invention The invention aims to provide an intelligent companion-oriented emotion personification multi-modal voice interaction large model system, which solves the technical problems of emotion interaction templatization, low emotion recognition precision, difficult multi-modal data synchronization, insufficient cross-modal understanding, user portrait deletion, poor multi-round interaction effect and high real-time interaction delay of an intelligent companion toy in the prior art. In order to achieve the aim of the invention, the invention adopts the following technical scheme: The intelligent companion-oriented emotional personification multi-modal voice interaction large model system is characterized by comprising a user interaction module, a multi-modal data fusion module, a database module, a data analysis module and a function processing module, wherein the user interaction module comprises touch interaction, voice interaction, text interaction and local interaction, the multi-modal data fusion module is connected with the user interaction module, a cross-modal attention mechanism is adopted to conduct feature extraction, alignment and fusion on multi-modal interaction data to generate unified multi-modal feature representation, the multi-modal data fusion module comprises feature extraction, data alignment and fusion algorithm, the database module is responsible for storage and management of multi-dimensional data, long-term memory and personalized service of an intelligent agent is supported, the data analysis module is connected with the database module and comprises storage of user and cloud information, user preference records and multi-modal history interaction records, multi-source data of the data analysis database are relied on, deep mining of user behaviors and demands is achieved, the multi-modal feature analysis and preference dynamic update are respectively connected with the multi-modal data fusion module and the multi-modal feature analysis module, the face analysis module comprises face analysis mod