CN-121996064-A - AI digital human real-time interaction method, system and equipment based on WEB browser

CN121996064ACN 121996064 ACN121996064 ACN 121996064ACN-121996064-A

Abstract

The invention discloses an AI digital person real-time interaction method, system and equipment based on a WEB browser, wherein the method specifically comprises the steps of processing a large AI digital person generation model of a cloud by adopting a model distillation and quantitative compression technology, generating and deploying a lightweight model to a user WEB browser to form a local generation engine; the method comprises the steps of receiving user historical interaction data, dynamically constructing and loading a personalized digital human model based on a local generation engine according to the user historical interaction data safely synchronized from a server, collecting multi-modal input data of a user in the interaction process, analyzing the multi-modal input data in real time, identifying the emotion state and the interaction intention of the user, and generating a digital human expression, an action frame sequence and corresponding voice synthesis parameters matched with the personalized digital human model in real time according to the personalized digital human model based on the emotion state and the interaction intention. The invention realizes the real-time interaction of the AI digital people based on the WEB browser, improves the interaction convenience and the real-time performance, and reduces the dependence on cloud resources.

Inventors

LI HAOYING

Assignees

广州三七极耀网络科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251217

Claims (10)

1. The AI digital human real-time interaction method based on the WEB browser is characterized by comprising the following steps: processing a large AI digital person generation model of the cloud by adopting a model distillation and quantization compression technology, generating and deploying a lightweight model comprising core expression, mouth shape and simple limb action generation capacity to a user Web browser to form a local generation engine; Based on a local generation engine, dynamically constructing and loading a personalized digital human model containing user preference appearance characteristics, a voice style library and a behavior response strategy according to user history interaction data safely synchronized from a server; In the interaction process, acquiring multi-modal input data of a user, and analyzing the multi-modal input data in real time by utilizing an integrated lightweight multi-modal emotion calculation model to identify the emotional state and the interaction intention of the user; Based on the emotion state and the interaction intention, generating digital human expression, an action frame sequence and corresponding voice synthesis parameters matched with the personalized digital human model in real time according to the personalized digital human model; And automatically mining a behavior mode implicit in user feedback according to a contrast learning technology, optimizing a response strategy and generating quality of the personalized digital human model by generating an countermeasure network, and incrementally updating the optimized personalized digital human model to a browser end.
2. The method of claim 1, wherein the processing the large AI digital person generation model of the cloud using model distillation and quantization compression techniques generates and deploys a lightweight model comprising core expression, mouth shape, and simple limb motion generation capabilities to a user Web browser to form a local generation engine, comprising: On the cloud, based on a model distillation technology, training a student model by taking a large AI digital person generation model as a teacher model, wherein the student model is used for inheriting the capability of core expression, mouth shape and limb actions of the large AI digital person generation model; carrying out quantization and compression processing on the student model to generate a light reasoning model file; compiling the reasoning model file into WebAssembly modules at a browser end, loading the modules, and constructing a local generation engine in a web browser environment; When interaction is triggered, a local generation engine is called through JavaScript logic of the Web browser, and sequence image data and associated voice parameters of a digital person are generated through real-time reasoning by means of the drive parameters.
3. The method according to claim 1, wherein the dynamically constructing and loading the personalized digital person model comprising the user preference appearance feature, the voice style library and the behavior response policy based on the locally generated engine according to the user history interaction data safely synchronized from the server comprises: At the cloud, historical interaction data of a user is specified in an analysis server, and a personalized configuration parameter set containing appearance feature vectors, voice style parameters and behavior response strategy weights is extracted and generated; Synchronizing the personalized configuration parameter set from the server to a Web browser at the browser end based on the secure communication link; In the Web browser, receiving and analyzing the synchronous personalized configuration parameter set by using the deployed local generation engine to obtain an analysis result; based on the analysis result, a style generation module is constructed according to the appearance feature vector, a voice synthesis pipeline is constructed according to the voice style parameter, and a behavior decision logic module is constructed according to the behavior response strategy weight, so that a personalized digital human model is formed.
4. The method according to claim 1, wherein during the interaction, collecting the multimodal input data of the user, and analyzing the multimodal input data in real time by using the integrated lightweight multimodal emotion calculation model, and identifying the emotional state and the interaction intention of the user, specifically includes: in a Web browser environment, text streams, audio streams and video streams generated by user interaction are collected in parallel and used as multi-mode input data; Inputting the multi-mode input data into a corresponding local lightweight feature extraction module for processing to generate text semantic feature vectors, voice acoustic feature vectors and visual expression feature vectors; The text semantic feature vector, the voice acoustic feature vector and the visual expression feature vector are synchronously input into a locally deployed light-weight multi-mode fusion decision model, fusion calculation is carried out on the input multi-mode feature vector through the light-weight multi-mode fusion decision model, and real-time emotion state classification and instant interaction intention of a user are identified.
5. The method according to claim 1, wherein the generating, based on the emotional state and the interactive intention, the digital human expression, the sequence of action frames and the corresponding speech synthesis parameters matched with the personalized digital human model in real time comprises: Based on the emotional state and the interactive intention, acquiring a comprehensive condition coding vector by combining the user preference parameters in the loaded personalized digital human model and the text content to be replied; Inputting the condition coding vector into a locally deployed lightweight condition generation model for single reasoning, and synchronously generating a digital human visual action parameter sequence and a voice synthesis parameter sequence which are aligned in time sequence, wherein the visual action parameter sequence is used for defining facial expression and limb actions of a digital human, and the voice synthesis parameter sequence is used for defining acoustic characteristics of synthesized voice; The visual action parameter sequence is conveyed to a graphic rendering interface of the web browser to drive the digital human figure to be animated in real time, and the voice synthesis parameter sequence is conveyed to an audio synthesis interface of the web browser to generate a corresponding voice waveform.
6. The method according to claim 1, wherein the automatically mining the behavior pattern implicit by the user feedback according to the contrast learning technique optimizes the response strategy and the quality of the personalized digital mannequin by generating the countermeasure network, and incrementally updates the personalized digital mannequin after optimization to the browser, specifically comprising: based on a desensitization implicit feedback sequence collected in a local interaction process of a web browser, constructing a label-free interaction data set containing a user follow-up behavior label at a cloud end; According to the non-labeling interaction data set, an implicit positive response mode and an implicit negative response mode are automatically mined through an unsupervised comparison learning technology, and a mode encoder for distinguishing the response strategy is obtained through training; Based on the optimized signal provided by the mode encoder, driving the personalized digital human model as the generator to optimize the response strategy and the generation quality through the antagonism training process of the discriminator and the generator in the generation antagonism network deployed at the server, so that the output of the personalized digital human model approaches to the positive response mode; And performing differential comparison on the optimized parameters of the personalized digital mannequin and the original model parameters of the browser end to generate an incremental update patch, and synchronizing the incremental update patch to the web browser environment of the browser end.
7. The method according to claim 6, wherein the training is performed to obtain a pattern encoder for distinguishing the response strategy from the implicit positive response pattern and the negative response pattern according to the non-labeling interaction data set by using an unsupervised contrast learning technology, and specifically comprises: Labeling pseudo tags belonging to a positive response mode or a negative response mode for each round of digital human response strategies according to preset session continuity objective indexes based on the non-labeling interaction data set; Constructing a positive sample pair and a negative sample pair for contrast learning from a label-free interaction data set based on a pseudo label, wherein the positive sample pair comprises response strategies belonging to the same positive response mode, and the negative sample pair comprises response strategies belonging to the positive response mode and the negative response mode respectively; Based on the positive and negative pairs of samples, a pattern encoder is trained by comparing the loss function, the pattern encoder is used for mapping the input digital human response strategy context characteristics into strategy pattern embedded vectors, and the strategy pattern embedded vectors corresponding to the positive response patterns are close to each other and the strategy pattern embedded vectors corresponding to the negative response patterns are far away from each other in the embedded space.
8. The method according to claim 7, wherein the optimizing signal provided by the mode encoder drives the personalized digital mannequin as the generator to optimize the response strategy and the generation quality through the antagonism training process of the discriminators and the generator in the generation antagonism network deployed at the server, so that the output of the personalized digital mannequin approaches to the positive response mode, specifically comprising: constructing a discriminator for generating an countermeasure network based on the strategy mode embedded vector output by the mode encoder, wherein the discriminator is used for inheriting the identification capability of the mode encoder to the positive response mode; Setting a strategy generation module of a personalized digital human model to be optimized to generate a generator of an reactance network; According to the countermeasure training frame, alternately executing a discriminator training stage and a generator training stage; In the training stage of the discriminator, comparing and training the positive response mode sample provided by the mode encoder with the sample generated by the generator; In the training stage of the generator, optimizing parameters of the generator according to the evaluation result of the discriminator on the generated sample and the task completion degree requirement; By iteratively executing the countermeasure training process, the response strategy generated by the drive generator approaches in distribution to the aggressive response pattern defined by the pattern encoder.
9. An AI digital human real-time interaction system based on a WEB browser is characterized in that the system specifically comprises: the model distillation module is used for processing a large AI digital person generation model of the cloud by adopting a model distillation and quantitative compression technology, generating and deploying a lightweight model comprising core expression, mouth shape and simple limb action generation capacity to a user Web browser, and forming a local generation engine; The local model module is used for dynamically constructing and loading a personalized digital human model containing user preference appearance characteristics, a voice style library and a behavior response strategy according to user history interaction data safely synchronized from a server based on a local generation engine; The interaction analysis module is used for collecting multi-modal input data of a user in the interaction process, analyzing the multi-modal input data in real time by utilizing the integrated lightweight multi-modal emotion calculation model, and identifying the emotional state and the interaction intention of the user; The parameter generation module is used for generating digital human expression, action frame sequences and corresponding voice synthesis parameters matched with the personalized digital human model in real time based on the emotional state and the interaction intention; And the optimization updating module is used for automatically mining a behavior mode implicit in user feedback according to a comparison learning technology, optimizing the response strategy and the generation quality of the personalized digital human model by generating an countermeasure network, and incrementally updating the optimized personalized digital human model to the browser end.
10. A computer device comprising a memory and a processor and a computer program stored on the memory, wherein the computer program when executed on the processor implements the WEB browser based AI digital human real time interaction method of any one of claims 1 to 8.

Description

AI digital human real-time interaction method, system and equipment based on WEB browser Technical Field The invention relates to the technical field of artificial intelligence, in particular to an AI digital person real-time interaction method, system, equipment and medium based on a WEB browser. Background With the rapid development of artificial intelligence technology, AI digital people show great potential and commercial value in numerous application scenes such as virtual customer service, online education, entertainment and the like, and are widely focused and welcome. However, the existing AI digital person implementation method has a plurality of limitations, which severely restrict the further popularization and application. First, most current AI digital technology relies on specific platforms or software, such as specific operating systems, graphics rendering engines, or dedicated client software. This dependency makes it difficult for AI numerologies to run stably on different types, versions of devices and browsers. For example, some AI digital man applications developed based on Windows platform cannot be directly used on Mac OS or Linux system, and some schemes depending on specific graphic rendering engine cannot be normally displayed and interacted in other environments without the engine installed. The lack of cross-platform compatibility greatly limits the application range of the AI digital person, and cannot meet the requirements of users for using the AI digital person under different devices and environments. Second, implementing existing AI digital person technologies typically requires a specialized graphics rendering engine and experienced developers. Professional graphics rendering engines tend to be expensive, and require extensive knowledge in many areas, such as graphics, computer vision, and the like, by developers, as well as skill in the use of the associated engines. From model design, animation production to interactive logic development, the whole process involves a plurality of complex links, and a great deal of manpower, material resources and time cost are required to be input. This not only makes it difficult for small businesses and individual developers to afford the expense of developing AI digital personal applications, but also limits the innovation and development speed of AI digital personal technology. Also, many existing AI digital man schemes require the user to download additional plug-in or client software to be usable. This requirement increases the difficulty of use and the number of operating steps for the user, which is certainly a difficult threshold for some users who are unfamiliar with computer operations or who have security concerns with downloading software. The user needs to spend time and effort to find, download, install and configure the related plug-ins or clients, and the process may need to be repeated when using on different devices, which greatly reduces the user's willingness and enthusiasm to use AI digital people and is not beneficial to the popularization of AI digital people technology. Finally, the existing partial AI digital human interaction method generally adopts a mode of firstly completely processing user input data, generating complete video or audio content and then playing the video or audio content. This non-fully real-time interaction approach can lead to significant delays in processing complex tasks or large amounts of data, requiring users to wait a long time to see the digital person's responses, severely affecting the user experience. For example, in a virtual customer service scenario, after a user has posed a problem, if the user waits for a few seconds or more before getting a reply from a digital person, the user may feel restlessness, and the quality of service and the user satisfaction may be reduced. Disclosure of Invention The invention aims to provide an AI digital person real-time interaction method, an AI digital person real-time interaction system, AI digital person real-time interaction equipment and AI digital person real-time interaction medium based on a WEB browser, so that the interaction convenience and the real-time performance are improved, the dependence on cloud resources is reduced, and at least one of the problems in the prior art is solved. In a first aspect, the present invention provides an AI digital human real-time interaction method based on a WEB browser, where the method specifically includes: processing a large AI digital person generation model of the cloud by adopting a model distillation and quantization compression technology, generating and deploying a lightweight model comprising core expression, mouth shape and simple limb action generation capacity to a user Web browser to form a local generation engine; Based on a local generation engine, dynamically constructing and loading a personalized digital human model containing user preference appearance characteristics, a voice