CN-121979438-A - Intelligent interactive intention understanding method and system based on multi-mode fusion

CN121979438ACN 121979438 ACN121979438 ACN 121979438ACN-121979438-A

Abstract

The application provides an intelligent interactive intention understanding method and system based on multi-mode fusion, which relate to the technical field of intelligent interaction. And carrying out path fractal dimension calculation on the sight sequence to obtain a sight intention deterministic index, and carrying out fluctuation singular spectrum analysis on the touch sequence to obtain a touch intention stability index. And calling the intention understanding model to perform index competition game processing on the two indexes, and generating a user real intention identifier. And extracting an interactive response behavior chain from the intention response topological network according to the real intention identification of the user, compiling the interactive response behavior chain, generating an instruction, and sending the instruction to the target application system to trigger feedback operation. The application improves the accuracy and the user experience of intelligent interaction.

Inventors

HAN XIANG
Ran Yunlong
HU HAIJUN

Assignees

上海明奇网络科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260408

Claims (10)

1. An intelligent interactive intention understanding method based on multi-modal fusion, which is characterized by comprising the following steps: presenting an ambiguity resolution option set corresponding to the current interaction round on a user interaction interface, and synchronously collecting a sight line free path sequence and a touch pressure fluctuation sequence of a user aiming at each ambiguity resolution option in the ambiguity resolution option set; performing path fractal dimension calculation processing on the sight free path sequence to obtain a sight intention deterministic index, and performing fluctuation singular spectrum analysis processing on the touch pressure fluctuation sequence to obtain a touch intention stability index; Invoking a pre-built intention understanding model to perform index competition game processing on the sight intention deterministic index and the touch intention stability index, and generating a user real intention identifier corresponding to the current interaction round; And dynamically extracting a corresponding interactive response behavior chain from a preset intention response topological network according to the user real intention identification, compiling the interactive response behavior chain to generate an interactive response instruction, and sending the interactive response instruction to a target application system to trigger corresponding interactive feedback operation.
2. The intelligent interactive intention understanding method based on multi-mode fusion according to claim 1, wherein the step of performing path fractal dimension calculation processing on the sight free path sequence to obtain a sight intention certainty index, and simultaneously performing fluctuation singular spectrum analysis processing on the touch pressure fluctuation sequence to obtain a touch intention stability index comprises the following steps: Performing line-of-sight starting point calibration processing on the line-of-sight free path sequence, identifying projection starting point coordinates of the user line of sight which is projected to a screen area corresponding to each ambiguity resolution option for the first time from the line-of-sight free path sequence, and recording line-of-sight projection starting time corresponding to each ambiguity resolution option; Performing line-of-sight free track tracking processing on the line-of-sight free path sequence, extracting the projection starting point coordinates from the line-of-sight free path sequence, and continuously freeing the user line of sight on a screen to form a line-of-sight free track point coordinate sequence; Performing grid coverage counting processing on the sight free track point coordinate sequence, mapping the sight free track point coordinate sequence into a grid coordinate system with a preset scale, and counting the number of grid cells covered by the sight free track point coordinate sequence as a first coverage counting parameter; Performing grid scale scaling treatment on the sight free track point coordinate sequence, reducing the grid cell size of the grid coordinate system according to a preset scaling proportion, generating a scaled grid coordinate system, and re-counting the number of grid cells covered by the sight free track point coordinate sequence in the scaled grid coordinate system as a second coverage count parameter; Calculating fractal dimension approximation values of the sight free track point coordinate sequences according to the first coverage count parameters, the second coverage count parameters and the preset scaling, and taking the fractal dimension approximation values as sight free track fractal dimension parameters corresponding to the sight free track sequences; Carrying out negative mapping treatment on the fractal dimension parameters of the free track of the sight, substituting the fractal dimension parameters of the free track of the sight into a negative correlation mapping function, and calculating to obtain the sight intention deterministic index, wherein the sight intention deterministic index and the fractal dimension parameters of the free track of the sight are in a negative correlation relationship; performing pressure starting point detection processing on the touch pressure fluctuation sequence, extracting a touch pressure starting value generated when a user finger first contacts a screen from the touch pressure fluctuation sequence, and recording a pressure starting moment corresponding to the touch pressure starting value; Performing pressure fluctuation interval division processing on the touch pressure fluctuation sequence, and dividing the touch pressure fluctuation sequence into a pressure rising fluctuation subsequence, a pressure maintaining wave subsequence and a pressure falling wave subsequence according to the touch pressure starting value; performing ascending trend fitting treatment on the pressure ascending fluctuation subsequence, calculating differential values of adjacent touch pressure values in the pressure ascending fluctuation subsequence, and dividing the total time span of the pressure ascending fluctuation subsequence after accumulating and summing all the differential values to generate a pressure ascending average speed parameter; performing maintenance stability calculation processing on the pressure maintenance wave sub-sequence, calculating standard deviations of all touch pressure values in the pressure maintenance wave sub-sequence, and taking the reciprocal of the standard deviation as a pressure maintenance stability parameter; performing descending trend fitting treatment on the pressure descending wave sub-sequence, calculating differential absolute values of adjacent touch pressure values in the pressure descending wave sub-sequence, and dividing the total time span of the pressure descending wave sub-sequence after accumulating and summing all the differential absolute values to generate a pressure descending average rate parameter; constructing a touch pressure fluctuation feature vector according to the pressure rising average speed parameter, the pressure maintaining stationarity parameter and the pressure falling average speed parameter, inputting the touch pressure fluctuation feature vector into a preset singular spectrum analysis model for singular value decomposition processing, and generating at least one singular value component; and extracting a main singular value component with the maximum energy ratio from the at least one singular value component, and taking the amplitude of the main singular value component as the touch intention stability index.
3. The intelligent interactive intention understanding method based on multi-modal fusion according to claim 1, wherein the invoking the pre-built intention understanding model to perform an exponential competition game processing on the gaze intention certainty index and the touch intention stability index, generating the user real intention identifier corresponding to the current interactive turn comprises: Inputting the sight intention deterministic index into a sight index encoder of the intention understanding model to perform index feature embedding processing to generate a sight deterministic feature vector, wherein the sight deterministic feature vector comprises a numerical code corresponding to the sight intention deterministic index and a path morphology code corresponding to the sight free path sequence; Inputting the touch intention stability index into a touch index encoder of the intention understanding model to perform index feature embedding processing, and generating a touch stability feature vector, wherein the touch stability feature vector comprises a numerical code corresponding to the touch intention stability index and a fluctuation form code corresponding to the touch pressure fluctuation sequence; Invoking an exponential competition game module of the intention understanding model to perform zero and game calculation processing on the sight deterministic feature vector and the touch stability feature vector, and generating a sight touch game payment matrix, wherein the sight touch game payment matrix is used for representing the income distribution relation of the sight deterministic feature vector and the touch stability feature vector under different strategy combinations; Performing first round of strategy adjustment processing on the sight deterministic feature vector according to the sight touch game payment matrix, performing strategy correction processing on feature dimensions with strategy conflict with the touch stability feature vector in the sight deterministic feature vector according to a preset first game coefficient, and generating a sight feature vector after preliminary game; Performing a first round of strategy adjustment processing on the touch stability feature vector according to the sight touch game payment matrix, and performing strategy correction processing on feature dimensions, which have strategy conflict with the sight deterministic feature vector, in the touch stability feature vector according to a preset second game coefficient to generate a touch feature vector after the preliminary game; Inputting the vision characteristic vector after the preliminary game and the touch characteristic vector after the preliminary game into a Nash equilibrium solving layer of the index competition game module to perform equilibrium policy solving processing, and generating a mixed policy Nash equilibrium vector which contains an equilibrium policy probability value corresponding to each ambiguity resolution option; and carrying out maximum value index positioning processing on the mixed strategy Nash equilibrium vector, extracting an index position corresponding to an element with the maximum value from the mixed strategy Nash equilibrium vector, and determining a corresponding ambiguity intention resolution option from the ambiguity intention resolution option set according to the index position as the user real intention identification.
4. The intelligent interactive intention understanding method based on multi-modal fusion according to claim 1, wherein dynamically extracting a corresponding interactive response behavior chain from a preset intention response topology network according to the user real intention identification, compiling the interactive response behavior chain to generate an interactive response instruction, and sending the interactive response instruction to a target application system to trigger a corresponding interactive feedback operation, and the method comprises the following steps: analyzing an intention node code contained in the user real intention identification, and positioning an initial intention node corresponding to the intention node code from the preset intention response topology network; starting from the initial intention node, performing depth-first traversal processing according to a preset directed edge connection relation in the intention response topological network, and generating at least one candidate response path from the initial intention node to the termination intention node; Acquiring system resource occupation state parameters corresponding to the current interaction round, wherein the system resource occupation state parameters comprise a central processing unit idle rate, a memory available capacity and a network bandwidth remaining amount; Performing resource consumption prediction processing on each candidate response path according to the idle rate of the central processing unit, the available capacity of the memory and the residual amount of the network bandwidth, and calculating the total amount of system resources expected to be consumed by each candidate response path in the execution process; selecting a candidate response path with the smallest total amount of expected consumed system resources from at least one candidate response path as a target interactive response behavior chain; Carrying out node operation analysis processing on each intention node contained in the target interactive response behavior chain, and extracting a corresponding operation type code and an operation parameter list from each intention node; According to the appearance sequence of the intention nodes in the target interactive response behavior chain, sequentially arranging the operation type codes and the operation parameter list corresponding to each intention node to generate an operation instruction sequence; Inputting the operation instruction sequence into a preset instruction compiler for instruction compiling, converting the operation instruction sequence into a binary instruction code stream which can be identified by a target application system, and generating an instruction code stream part of an interactive response instruction; Extracting a service discovery protocol identifier and a service access endpoint address of a target application system from preset network configuration parameters, and establishing remote procedure call connection with the target application system according to the service discovery protocol identifier and the service access endpoint address; Packaging the instruction code stream part of the interactive response instruction into a message body load area of a remote procedure call protocol, adding a call identification field and a call timeout field in the head of the message body, and generating a remote procedure call request message to be sent; Transmitting the remote procedure call request message to the target application system through the established remote procedure call connection, and starting a call timeout timer to wait for a remote procedure call response message returned by the target application system; And receiving a remote procedure call response message returned by the target application system, analyzing a call return state code and call return data from the remote procedure call response message, and writing the call return state code and the call return data into an interaction log record.
5. The intelligent interactive intention understanding method based on multi-modal fusion according to claim 1, wherein the method further comprises, before presenting the ambiguity resolution option set corresponding to the current interaction round: Acquiring a user input voice stream corresponding to the current interaction round, and performing voice recognition processing on the user input voice stream to generate a user input text sequence; Performing semantic slot filling processing on the user input text sequence, and identifying at least one semantic slot contained in the user input text sequence and a slot filling value corresponding to each semantic slot; Performing slot conflict detection processing on the at least one semantic slot, and detecting whether slot conflict events with mutually contradictory slot filling values of at least two semantic slots exist in the at least one semantic slot; If the situation that the slot conflict event exists is detected, generating slot conflict description information according to at least two semantic slots related to the slot conflict event; Inquiring at least one ambiguity intention node associated with the slot conflict description information from a preset ambiguity intention knowledge graph, wherein each ambiguity intention node corresponds to an ambiguity intention to be resolved; Node attribute analysis processing is carried out on the at least one ambiguity intention node, and corresponding ambiguity intention description text and ambiguity intention resolution prompt information are extracted from each ambiguity intention node; Generating ambiguity resolution options corresponding to each ambiguity intention node according to ambiguity description texts corresponding to each ambiguity intention node and ambiguity resolution prompt information, and combining all ambiguity resolution options into an ambiguity resolution option set; and presenting the ambiguity resolution option set in a floating window form on the user interaction interface, and synchronously displaying corresponding ambiguity resolution prompt information beside each ambiguity resolution option.
6. The intelligent interactive intention understanding method based on multi-mode fusion according to claim 1, wherein the method is characterized in that the method comprises the steps of performing path fractal dimension calculation processing on the sight free path sequence to obtain a sight intention certainty index, and simultaneously performing fluctuation singular spectrum analysis processing on the touch pressure fluctuation sequence to obtain a touch intention stability index, and further comprising the steps of: Performing line-of-sight regression detection processing on the line-of-sight free path sequence, identifying regression events of separating the user line of sight from a screen area corresponding to a current ambiguity resolution option to a screen area corresponding to other ambiguity resolution options and then returning to the screen area corresponding to the current ambiguity resolution option again in the line-of-sight free path sequence, and recording line-of-sight regression times corresponding to each ambiguity resolution option; Performing regression attenuation adjustment processing on the sight intention deterministic index corresponding to each ambiguity resolution option according to the sight regression times corresponding to each ambiguity resolution option, multiplying the sight intention deterministic index corresponding to each ambiguity resolution option by a regression attenuation coefficient inversely related to the sight regression times, and generating a regression adjusted sight intention deterministic index corresponding to each ambiguity resolution option; Updating the line of sight intention deterministic index according to the regression-adjusted line of sight intention deterministic index corresponding to each ambiguity resolution option; Performing pressure mutation detection processing on the touch pressure fluctuation sequence, identifying pressure mutation points in the touch pressure fluctuation sequence, wherein the change amplitude of the touch pressure value in unit time of the pressure mutation points exceeds a preset mutation threshold value, and recording mutation time and mutation direction corresponding to each pressure mutation point; dividing the pressure mutation points into positive pressure mutation points and negative pressure mutation points according to mutation directions corresponding to the pressure mutation points; Counting the number of positive pressure mutation points and the number of negative pressure mutation points contained in the touch pressure fluctuation sequence, calculating the absolute value of the difference value between the number of positive pressure mutation points and the number of negative pressure mutation points, and dividing the absolute value of the difference value by the total number of the positive pressure mutation points and the negative pressure mutation points to generate a pressure mutation direction unbalance degree parameter; Performing unbalance penalty adjustment processing on the touch intention stability index according to the pressure mutation direction unbalance degree parameter, multiplying the touch intention stability index by an unbalance penalty coefficient negatively related to the pressure mutation direction unbalance degree parameter, and generating an unbalance adjusted touch intention stability index; And updating the touch intention stability index according to the touch intention stability index after unbalance adjustment.
7. The intelligent interactive intention understanding method based on multi-modal fusion according to claim 3, wherein the invoking the pre-built intention understanding model performs an exponential competition game processing on the gaze intention certainty index and the touch intention stability index to generate the user real intention identifier corresponding to the current interactive turn, and further comprising: acquiring a historical interaction trust parameter corresponding to the current interaction round, wherein the historical interaction trust parameter comprises the times of successful interaction and the times of failed interaction of the user and the target application system in the historical interaction round before the current interaction round; Calculating historical interaction success rate parameters according to the ratio of the times of interaction success to the sum of the times of interaction success and the times of interaction failure; calculating a trust weighting coefficient according to the historical interaction success rate parameter, and adjusting the first game coefficient and the second game coefficient by using the trust weighting coefficient to generate an adjusted first game coefficient and an adjusted second game coefficient; performing a second round of strategy adjustment processing on the vision characteristic vector after the preliminary game according to the vision touch game payment matrix by using the adjusted first game coefficient, and performing strategy correction processing on characteristic dimensions with strategy conflict with the touch stability characteristic vector in the vision characteristic vector after the preliminary game to generate a vision characteristic vector after the secondary game; Performing a second round of strategy adjustment processing on the touch characteristic vector after the preliminary game according to the sight touch game payment matrix by using the adjusted second game coefficient, and performing strategy correction processing on characteristic dimensions with strategy conflict with the sight deterministic characteristic vector in the touch characteristic vector after the preliminary game to generate a touch characteristic vector after the secondary game; inputting the sight feature vector after secondary game and the touch feature vector after secondary game into a Nash equilibrium solving layer of the index competition game module to perform secondary equilibrium strategy solving processing, and generating a secondary mixed strategy Nash equilibrium vector; And carrying out maximum value index positioning processing on the secondary mixing strategy Nash equilibrium vector, extracting an index position corresponding to an element with the maximum value from the secondary mixing strategy Nash equilibrium vector, and determining a corresponding ambiguity resolution option from the ambiguity resolution option set according to the index position as the real user intention identifier.
8. The intelligent interactive intention understanding method based on multi-modal fusion according to claim 2, wherein the method further comprises, after generating the user real intention identification corresponding to the current interaction round: Performing association storage processing on the sight free path sequence and the touch pressure fluctuation sequence corresponding to the current interaction round by the real intention identification of the user to generate a current interaction round game record; Adding the current interactive round game record to a history game record set, and updating the history game record set; Extracting a historical sight free path sequence and a historical touch pressure fluctuation sequence corresponding to all the historical interaction turns with the same user real intention identification from the updated historical game record set; Carrying out path fractal dimension average calculation processing on the historical sight free path sequences corresponding to all the historical interaction turns, calculating the arithmetic average value of the historical sight free track fractal dimension parameters corresponding to each ambiguity resolution option, and processing the arithmetic average value by utilizing the negative correlation mapping function to generate a historical average sight intention certainty index; Carrying out fluctuation singular spectrum average analysis processing on the historical touch pressure fluctuation sequences corresponding to all the historical interaction rounds, calculating the arithmetic average value of the amplitudes of the historical main singular value components corresponding to each ambiguity resolution option, and generating a historical average touch intention stability index; inputting the historical average sight intention deterministic index and the historical average touch intention stability index into an index competition game module of the intention understanding model to perform index competition game processing, and generating a user real intention identifier after the historical game; Performing consistency comparison processing on the user real intention identification after the historical game and the user real intention identification, and generating a game model calibration trigger signal if the user real intention identification after the historical game is inconsistent with the user real intention identification; And calling a parameter updating module of the intent understanding model according to the game model calibration trigger signal, and performing fine adjustment updating processing on model parameters of an exponential competition game module of the intent understanding model by using the historical sight free path sequences and the historical touch pressure fluctuation sequences corresponding to all the historical interaction turns as training data.
9. The intelligent interactive intention understanding method based on multi-modal fusion according to claim 1, wherein the dynamically extracting a corresponding interactive response behavior chain from a preset intention response topology network according to the user real intention identifier, compiling the interactive response behavior chain to generate an interactive response instruction, and sending the interactive response instruction to a target application system to trigger a corresponding interactive feedback operation, and further comprising: Extracting at least two candidate interaction response behavior chains associated with the user real intention identification from the preset intention response topological network, wherein each candidate interaction response behavior chain corresponds to different response complexity levels; acquiring a user cognitive load parameter corresponding to the current interaction round, wherein the user cognitive load parameter comprises continuous interaction time of a user before the current interaction round, misoperation times of the user in the current interaction round and response delay time of the user in the current interaction round; Constructing a user cognitive load characteristic vector according to the continuous interaction time length, the operation error times and the response delay time length, inputting the user cognitive load characteristic vector into a preset cognitive load evaluation model to perform load level prediction processing, and generating a current cognitive load level identifier of a user; Selecting a candidate interactive response behavior chain matched with the current cognitive load grade identification of the user from the at least two candidate interactive response behavior chains as a target interactive response behavior chain; performing behavior chain decoupling processing on the target interactive response behavior chain, and splitting the target interactive response behavior chain into at least two sub-behavior chains which can be executed in parallel; Performing resource competition analysis processing on the at least two sub-behavior chains capable of being executed in parallel, and detecting whether resource competition conflict exists in the at least two sub-behavior chains capable of being executed in parallel in the execution process; if the resource competition conflict is detected to exist, performing execution sequence adjustment processing on the at least two sub-behavior chains capable of being executed in parallel, adjusting the sub-behavior chains with the resource competition conflict into a serial execution sequence, and generating an adjusted sub-behavior chain execution sequence; if the resource competition conflict is not detected, marking the at least two sub-behavior chains capable of being executed in parallel as a parallel execution sub-behavior chain set; According to the adjusted sub-behavior chain execution sequence or the parallel execution sub-behavior chain set, carrying out node operation analysis processing on each intention node contained in the target interactive response behavior chain, and extracting a corresponding operation type code and an operation parameter list from each intention node; And sequentially arranging the operation type codes and the operation parameter list corresponding to each intention node according to the execution sequence of the adjusted sub-action chain or the execution sequence of the intention nodes in the parallel execution sub-action chain set to generate an optimized operation instruction sequence.
10. A multimodal fusion-based intelligent interactive intention understanding system comprising a processor and a computer readable storage medium storing machine executable instructions that when executed by the processor implement the multimodal fusion-based intelligent interactive intention understanding method of any of claims 1-9.

Description

Intelligent interactive intention understanding method and system based on multi-mode fusion Technical Field The application relates to the technical field of intelligent interaction, in particular to an intelligent interaction intention understanding method and system based on multi-mode fusion. Background In the field of intelligent interaction, the accurate understanding of the interaction intention of a user is a key for realizing efficient and natural man-machine interaction. Existing intelligent interactive intention understanding methods mostly rely on data of a single modality, such as inferring user intention only through voice input, text input, or simple touch operation of the user. However, data of a single modality often has limitations, and cannot fully and accurately reflect the actual ideas of users. Taking voice interaction as an example, the voice of a user may be interfered by environmental noise, accent and other factors to cause deviation of semantic understanding, while simple touch operation such as clicking, sliding and the like may have various meanings under different scenes, so that the intention of the user is difficult to accurately judge. In addition, when there is ambiguity in user input, the existing method generally simply requires the user to reenter or select preset options, lacks deep mining and analysis of the user's potential intention, and cannot effectively disambiguate, thereby affecting the smoothness of interaction and user experience. Disclosure of Invention In view of the above, the present application aims to provide an intelligent interactive intention understanding method and system based on multi-modal fusion. According to a first aspect of the present application, there is provided a method for intelligent interactive intention understanding based on multimodal fusion, the method comprising: presenting an ambiguity resolution option set corresponding to the current interaction round on a user interaction interface, and synchronously collecting a sight line free path sequence and a touch pressure fluctuation sequence of a user aiming at each ambiguity resolution option in the ambiguity resolution option set; performing path fractal dimension calculation processing on the sight free path sequence to obtain a sight intention deterministic index, and performing fluctuation singular spectrum analysis processing on the touch pressure fluctuation sequence to obtain a touch intention stability index; Invoking a pre-built intention understanding model to perform index competition game processing on the sight intention deterministic index and the touch intention stability index, and generating a user real intention identifier corresponding to the current interaction round; And dynamically extracting a corresponding interactive response behavior chain from a preset intention response topological network according to the user real intention identification, compiling the interactive response behavior chain to generate an interactive response instruction, and sending the interactive response instruction to a target application system to trigger corresponding interactive feedback operation. According to a second aspect of the present application, there is provided a multimodal fusion-based intelligent interactive intention understanding system, the multimodal fusion-based intelligent interactive intention understanding system comprising a machine-readable storage medium storing machine-executable instructions and a processor, the processor implementing the multimodal fusion-based intelligent interactive intention understanding method described above when executing the machine-executable instructions. According to any one of the aspects, the application has the technical effects that: the ambiguity intention resolution option set is presented on the user interaction interface, the sight free path sequence and the touch pressure fluctuation sequence of the user are synchronously acquired, multi-mode information such as user vision and touch is fully utilized, path fractal dimension calculation is conducted on the sight free path sequence to obtain a sight intention deterministic index, fluctuation singular spectrum analysis is conducted on the touch pressure fluctuation sequence to obtain a touch intention stability index, and intention characteristics of the user can be quantified from different angles. The intention understanding model is invoked to conduct index competition game processing on the two indexes, so that complex thinking of human beings in the decision making process is simulated, and the generated user true intention identification is more accurate and reliable. According to the real intention identification of the user, an interactive response behavior chain is dynamically extracted, and an interactive response instruction is generated, so that personalized interactive feedback is realized, and the accuracy, fluency and user experience of intelligent intera