CN-122020539-A - Co-emotion-driven multimode cognition large-model anthropomorphic interaction optimization method

CN122020539ACN 122020539 ACN122020539 ACN 122020539ACN-122020539-A

Abstract

The invention provides a co-emotion driven multimode cognition big model anthropomorphic interaction optimization method, which comprises the steps of collecting texts, voices and face videos under unified clocks and rounds, synchronizing the rounds and the voice rhythms, extracting semantic, acoustic and expression attitude characteristics, calculating credibility parameters and uncertainty such as shielding, noise and contradiction, fusing the credibility parameters and uncertainty to obtain emotion cognition states, searching long-term/short-term user image weights to form the user cognition states, constructing a co-emotion decision state, driving a strategy control model to output co-emotion demand levels, co-emotion targets, expression intensity, personality constraints and safety marks, generating controlled texts, rhythm and action responses by the multimode cognition big model according to the results, calling a safety template when necessary, generating abstract and emotion track updated images based on user feedback, and improving cognition reliability, individuation and response safety, and is suitable for virtual image terminals and accompanying robots.

Inventors

Mao Taihui
WU JUNLIN
LUO KAI

Assignees

深圳通晤纪人工智能技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260128

Claims (7)

1. A co-emotion driven multi-mode cognition large model anthropomorphic interaction optimization method is characterized by comprising the following steps of: S1, acquiring user text, user voice and face video during interaction, and forming multi-mode interaction fragment sequence data according to dialogue turns and time stamps; S2, inputting multi-mode interaction segment sequence data into a synchronous fusion model of co-emotion clues, determining a round segmentation moment according to dialogue rounds, determining a voice rhythm moment according to voice pauses and accents, intercepting face frames according to the voice rhythm moment to form a synchronous segment sequence, sequentially extracting semantic features, acoustic features and expression posture features from the synchronous segment sequence, calculating shielding indexes, noise indexes and contradiction indexes to obtain a credibility parameter set, and fusing the credibility parameter set to obtain a mood cognition state result, wherein the mood cognition state result comprises a mood cognition state vector, uncertainty and credibility parameter set; S3, searching a user portrait library according to the emotion cognition state result to obtain a user cognition state, and combining the user cognition state and the emotion cognition state result to form a co-emotion decision state; s4, inputting a co-emotion decision state into a co-emotion strategy control model, judging a co-emotion demand level based on an emotion cognition state result in the co-emotion decision state, determining a co-emotion target, calculating language intensity, voice rhythm intensity and image action intensity under the constraint of a credibility parameter set to form a co-emotion expression intensity parameter, calculating a risk score based on negative emotion intensity and uncertainty, comparing the risk score with a risk threshold to obtain a safety control mark, and outputting a co-emotion strategy sequence arranged according to dialogue rounds, wherein the co-emotion strategy sequence comprises the co-emotion demand level, the co-emotion target, the co-emotion expression intensity parameter, personality constraint and the safety control mark; s5, generating a multi-mode anthropomorphic response by a multi-mode cognition big model based on the co-emotion strategy sequence and the emotion cognition state result, and outputting a safety template or a common response according to a safety control mark; and S6, generating a summary and an emotion track according to the multi-mode anthropomorphic response and the user feedback, and updating the user portrait library according to the summary and the emotion track to form updated user portrait library data.
2. The co-emotion driven multimode cognitive big model anthropomorphic interaction optimization method according to claim 1, wherein the step S1 is specifically: establishing a session identifier and a unified clock reference at the beginning of an interactive session, acquiring a user identity identifier, and initializing a dialogue round number; Collecting user voice, determining a round segmentation moment by voice activity detection, intercepting user voice fragments according to the round segmentation moment, marking start and stop time stamps, and calculating voice availability based on the user voice fragments; acquiring a face video, recording a frame time stamp according to frames, calculating a face detection confidence, and intercepting a corresponding face video frame sequence according to the round segmentation moment to form a round face video segment; And obtaining a user text, wherein the user text is a transcription text obtained by typing in the text or performing voice recognition on a user voice fragment, a timestamp is marked for the user text, and the user text, the user voice fragment, the round face video fragment, the voice availability and the face detection confidence of the same dialogue round are combined with the timestamp according to the dialogue round number to form multi-mode interaction fragment sequence data.
3. The co-emotion driven multimode cognitive big model anthropomorphic interaction optimization method according to claim 1, wherein the step S2 is specifically: Inputting the multi-mode interaction fragment sequence data into a turn synchronization layer of a co-emotion cue synchronization fusion model, reading dialogue turn numbers and time stamps to determine turn segmentation time, and carrying out turn segmentation on user texts, user voice fragments and turn face video fragments and aligning the user texts, the user voice fragments and the turn face video fragments to the same dialogue turn; Inputting the user voice fragments segmented in turn into a rhythm synchronization layer, respectively comparing the frame-level energy and fundamental frequency change of the user voice fragments with a preset pause threshold value and a preset emphasis threshold value to determine a pause position and a strong tuning position, generating a voice rhythm time sequence, intercepting face frames in the turn face video fragments according to the voice rhythm time sequence, and dividing the face frames to form a synchronization fragment sequence; Inputting a user text word element sequence in the synchronous fragment sequence into a semantic feature extraction network, adopting a two-hundred-fifty-six-dimensional word vector embedding layer and two bidirectional gating circulating layers, obtaining five hundred-one twelve-dimensional sequence features by two-hundred-fifty-six neurons of each layer, and obtaining two-hundred-fifty-six-dimensional semantic features by compressing a full-connection layer of one-hundred-fifty-six neurons; inputting a user voice mel spectrum sequence in the synchronous segment sequence into an acoustic feature extraction network, adopting two one-dimensional convolution layers with the channel number of sixty four and one hundred twenty eight respectively, connecting one two-way gating circulation layer behind each one-dimensional convolution layer, two-way gating circulation layer with one hundred twenty eight neurons respectively, and outputting two hundred fifty six-dimensional acoustic features; Inputting a face frame sequence in the synchronous segment sequence into an expression gesture feature extraction network, adopting three layers of two-dimensional convolution layers with the channel number of thirty-two, sixty-four and one hundred twenty-eight, obtaining one hundred twenty-eight-dimensional frame features through global pooling, and then adopting one hundred twenty-eight neurons of a gate control circulation layer to perform time sequence coding to output two hundred fifty-sixteen-dimensional expression gesture features; inputting semantic features, acoustic features and expression posture features into a credibility calculation layer according to the time sequence of a synchronous segment sequence, adopting three two-layer perceptron branches, wherein each branch is sixty-four neurons and a neuron output layer, respectively calculating a shielding index, a noise index and a contradiction index to form a credibility parameter set, wherein the shielding index is calculated based on the expression posture features and the face detection confidence level, the noise index is calculated based on the acoustic features and the voice availability, and the contradiction index is calculated based on the semantic features, the acoustic features and the expression posture feature splicing vectors and represents the difference of the cross-modal emotion directions; inputting the reliability parameter set, semantic features, acoustic features and expression attitude feature splice vectors into a gating fusion network of a reliability constraint fusion layer, outputting fusion weights by using two layers of full-connection layers and the number of neurons is one hundred twenty eight and three, obtaining two hundred fifty six-dimensional fusion features by weighting and summing the semantic features, the acoustic features and the expression attitude features based on the fusion weights, outputting sixty four-dimensional emotion cognition state vectors by using the two layers of full-connection layers and the number of neurons is one hundred twenty eight and sixty four, outputting uncertainty by using the full-connection layers of one layer of thirty two neurons based on the reliability parameter set, and combining to obtain emotion cognition state results comprising the emotion cognition state vectors, uncertainty and the reliability parameter set.
4. The co-emotion driven multimode cognitive big model anthropomorphic interaction optimization method according to claim 1, wherein the step S3 is specifically: Searching a user portrait record in a user portrait library according to the user identity, and reading a long-term user cognitive state and a short-term user cognitive state in the user portrait record, wherein the long-term user cognitive state and the short-term user cognitive state are thirty-two-dimensional vectors and comprise a appeal intensity component and a resistance intensity component, and a preset default long-term user cognitive state and a preset default short-term user cognitive state are adopted when the user portrait library is not hit; reading uncertainty and contradiction indexes according to the emotion cognition state result, comparing the uncertainty with a preset uncertainty threshold to obtain an uncertainty mark, comparing the contradiction indexes with a preset contradiction threshold to obtain a contradiction mark, and determining an image dependency coefficient according to the uncertainty mark and the contradiction mark; Carrying out weighted fusion on the long-term user cognitive state and the short-term user cognitive state according to the portrait dependency coefficient to obtain the user cognitive state of the current dialogue turn; and splicing the emotion cognition state vector, uncertainty, credibility parameter set and the user cognition state in the emotion cognition state result according to a preset dimension sequence to form a hundred-dimensional co-emotion decision state vector, and arranging according to dialogue turn numbers to form a co-emotion decision state for input of a co-emotion strategy control model.
5. The co-emotion driven multimode cognitive big model anthropomorphic interaction optimization method according to claim 1, wherein the step S4 is specifically: Inputting a hundred-dimensional co-condition decision state vector arranged according to dialogue turns in a co-condition decision state into a co-condition strategy control model, generating sixteen-dimensional turn features corresponding to dialogue turn numbers for each dialogue turn, and splicing the hundred-dimensional co-condition decision state vector and the sixteen-dimensional turn features to form a hundred-sixteen-dimensional input feature sequence; A sequence gating circulating layer with a hidden layer comprising one hundred twenty eight neurons is adopted to carry out time sequence coding on one hundred sixteen-dimensional input characteristic sequences, and one hundred twenty eight-dimensional state sequences arranged according to dialogue rounds are output; Inputting emotion cognition state vectors into a full-connection layer of thirty-two neurons to obtain negative emotion intensity, splicing the negative emotion intensity with a appeal intensity component and a resistance intensity component in a cognition state of a user, inputting the combined emotion intensity with a one hundred twenty eight-dimensional state sequence into an output head of a common emotion demand level, adopting two full-connection layers, wherein the number of the neurons is thirty-two and three output common emotion demand levels, and determining the common emotion demand level of a dialogue round according to the maximum value in three-dimensional output; Inputting a one hundred twenty-eight-dimensional state sequence and a cosmophilic demand level into a cosmophilic target output head, adopting two fully-connected layers, enabling the number of neurons to be thirty-two and six, generating expression stage parameters based on dialogue turn numbers and the one hundred twenty-eight-dimensional state sequence, and sequentially mapping the cosmophilic targets into a mood reflecting and feeling verifying stage, a clarifying problem stage, a suggestion or information explanation stage; Inputting a one hundred twenty-eight-dimensional state sequence into a co-condition expression intensity parameter output head, adopting two full-connection layers, wherein the number of neurons is thirty-two and three, the output language intensity, the voice rhythm intensity and the image action intensity, comparing the shielding index, the noise index and the contradiction index in the credibility parameter set with a preset shielding threshold value, a preset noise threshold value and a preset contradiction threshold value respectively to determine scaling coefficients, scaling and mapping the language intensity, the voice rhythm intensity and the image action intensity based on the scaling coefficients, so that the contradiction index triggers the clarification duty ratio adjustment of the language intensity, the noise index triggers the change range adjustment of the voice rhythm intensity, and the shielding index triggers the action amplitude adjustment of the image action intensity; Inputting the negative emotion intensity and uncertainty into a risk score output head, calculating a risk score by adopting a full-connection layer of thirty-two neurons and a neuron output layer, and comparing the risk score with a risk threshold value to obtain a safety control mark; inputting the cognitive state of the user into a personality constraint output head, adopting two layers of full-connection layers and outputting sixteen and eight personality constraints by the number of neurons, and combining the co-condition demand level, the co-condition target, the co-condition expression intensity parameter, the personality constraints and the safety control mark according to dialogue rounds to form a co-condition strategy sequence.
6. The co-emotion driven multimode cognitive big model anthropomorphic interaction optimization method according to claim 1, wherein the step S5 is specifically: Reading a co-emotion demand level, a co-emotion target, a co-emotion expression intensity parameter, personality constraints and a safety control mark of the current dialogue turn based on a co-emotion strategy sequence, and combining the co-emotion demand level, the co-emotion target, the co-emotion expression intensity parameter, the personality constraints and the safety control mark with an emotion cognition state vector and uncertainty in an emotion cognition state result to form a generation condition; Inputting the conversation context of the generation condition and the current conversation turn into a multi-mode cognition big model, adopting a co-emotion target and personality constraint response content and expression mode, and adopting co-emotion expression intensity parameters to respectively modulate the language intensity of a response text, the voice prosody intensity of voice synthesis and the image action intensity of an virtual image or a accompanying robot; generating multimodal anthropomorphic response candidates containing response text, voice prosody control parameters and image motion control parameters by the multimodal cognitive big model; And comparing the safety control mark with a preset safety mark, selecting a safety template according to a co-emotion target when the safety control mark is matched, outputting a response text, a voice rhythm control parameter and an image action control parameter corresponding to the safety template, and outputting a multi-mode anthropomorphic response candidate as a common response when the safety control mark is not matched to form a multi-mode anthropomorphic response.
7. The co-emotion driven multimode cognitive big model anthropomorphic interaction optimization method according to claim 1, wherein the step S6 is specifically: collecting user feedback based on dialogue turn numbers of multi-mode anthropomorphic response, wherein the user feedback comprises user text, user voice fragments, turn face video fragments and explicit evaluation information, and mapping the user feedback into a appeal intensity increment and a resistance intensity increment; inputting multi-modal anthropomorphic response, user feedback and dialogue context into a multi-modal cognition big model to generate a abstract, wherein the abstract comprises user appeal points, co-situation target execution conditions and safety control mark triggering conditions; Inputting emotion cognition state results, appeal intensity increment and resistance intensity increment which are arranged according to dialogue rounds into an emotion track generation module, carrying out weighted smoothing on emotion cognition state vectors according to a credibility parameter set, and obtaining round-by-round negative emotion intensity by adopting a preset negative emotion mapping table to form an emotion track; Writing the abstract and the emotion track into a user portrait library, updating a short-term user cognitive state based on a credibility parameter set and a preset credibility threshold, and updating a long-term user cognitive state when the emotion track is stable within a continuous preset round number threshold to form updated user portrait library data.

Description

Co-emotion-driven multimode cognition large-model anthropomorphic interaction optimization method Technical Field The invention relates to the technical field of man-machine interaction and multi-mode artificial intelligence, in particular to a co-emotion driven multi-mode cognitive large model anthropomorphic interaction optimization method. Background As the virtual image and the accompanying robot land in the scenes of service, education, medical treatment and the like, the man-machine interaction is changed from single text to multi-mode collaboration of voice, vision and semantics. Cosmopathy is regarded as a key for improving trust and viscosity, but speech noise, facial occlusion and cross-modal emotion inconsistency in a real scene increase the complexity of perception and generation control. How to align data with a rhythm layer in turn and apply reliable emotional clues to a generation side becomes an important background problem of multi-modal interaction. In the prior art, voice activity detection is generally used for round segmentation, text is obtained through voice recognition, facial detection and key point extraction are combined to obtain expression and gesture features, semantics, acoustics and vision are fused at a feature layer by using a convolution/circulation or attention mechanism to estimate emotion or intention, dialogue management or a large language model is then called to generate a reply, rules or blacklists are used for basic safety filtration, part of systems introduce user preference or portraits as static parameters to adjust the mood or content, and at an output end, the rhythm and virtual image actions of voice synthesis are mostly realized by using a fixed template or range mapping, and interaction is generally driven by round units. However, common schemes stay in round level alignment, lack of synchronization of rhythm levels based on pause and emphasis, easy occurrence of cross-modal clue time sequence inconsistency, lack of explicit modeling shielding, reliability constraint of noise and cross-modal contradiction in a fusion stage, insufficient coupling of state estimation and subsequent control, lack of unified control of linkage of demand level, targets, expression intensity, personnel setting and risk score on a strategy side, lack of closed-loop basis for generation and safety template switching, and user portrait update. Therefore, a multi-modal cognitive large model anthropomorphic interaction optimization method capable of solving the defects of the prior art is a problem which needs to be solved by a person skilled in the art. Disclosure of Invention The invention aims to provide a co-emotion-driven multi-mode cognition large-model anthropomorphic interaction optimization method, which aims to solve the core technical problems that a multi-mode anthropomorphic interaction closed loop driven by a co-emotion is constructed in an virtual image terminal and a companion robot scene, the rotation and the rhythm are synchronous, the credibility is restrained and fused, the user portrait self-adaption and strategic generation are realized, the safety convergence and the portrait updating are unified, and the stable and personalized scene interaction is realized. According to the embodiment of the invention, the co-emotion driven multi-modal cognitive large model anthropomorphic interaction optimization method comprises the following steps of: S1, acquiring user text, user voice and face video during interaction, and forming multi-mode interaction fragment sequence data according to dialogue turns and time stamps; S2, inputting multi-mode interaction segment sequence data into a synchronous fusion model of co-emotion clues, determining a round segmentation moment according to dialogue rounds, determining a voice rhythm moment according to voice pauses and accents, intercepting face frames according to the voice rhythm moment to form a synchronous segment sequence, sequentially extracting semantic features, acoustic features and expression posture features from the synchronous segment sequence, calculating shielding indexes, noise indexes and contradiction indexes to obtain a credibility parameter set, and fusing the credibility parameter set to obtain a mood cognition state result, wherein the mood cognition state result comprises a mood cognition state vector, uncertainty and credibility parameter set; S3, searching a user portrait library according to the emotion cognition state result to obtain a user cognition state, and combining the user cognition state and the emotion cognition state result to form a co-emotion decision state; s4, inputting a co-emotion decision state into a co-emotion strategy control model, judging a co-emotion demand level based on an emotion cognition state result in the co-emotion decision state, determining a co-emotion target, calculating language intensity, voice rhythm intensity and image action intensity under the constraint of a cr