CN-122020573-A - Interaction method based on multi-mode large model, storage medium and electronic device

CN122020573ACN 122020573 ACN122020573 ACN 122020573ACN-122020573-A

Abstract

The application discloses an interaction method, a storage medium and an electronic device based on a multi-mode large model, and relates to the technical field of smart families, wherein the interaction method based on the multi-mode large model comprises the steps of carrying out deep semantic analysis on user instructions through a routing network to generate a routing vector group; the routing vector group comprises an encoder control vector and a fusion device control vector, wherein the encoder control vector is used for modulating a self-attention mechanism of each modal encoder in at least one modal encoder in the multi-modal large model to generate modal characteristics of target modal information input into each modal encoder, and the fusion device control vector is used for fusing the modal characteristics of the multi-modal information corresponding to the user instruction to generate interaction information corresponding to the user instruction. The method solves the problem that the multi-modal large model can not dynamically process multi-modal information based on user instructions. The technical effect of strongly correlating the processing of the multi-mode large model to the multi-mode information with the user instruction is achieved.

Inventors

LIU YANJIA
TIAN YUNLONG
WANG MIAO
NIU LI

Assignees

青岛海尔科技有限公司
海尔优家智能科技（北京）有限公司
青岛海尔智能家电科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (15)

1. An interaction method based on a multi-mode large model is characterized by comprising the following steps: Carrying out deep semantic analysis on the user instruction through a routing network to generate a routing vector group; the routing vector group comprises an encoder control vector and a fusion device control vector, wherein the encoder control vector is used for controlling at least one modal encoder in a multi-modal large model, and the routing network comprises a text encoder, a multi-layer perceptron connected with the text encoder and a vector decomposition layer connected with the multi-layer perceptron; Generating a modal feature of target modal information input to each of the at least one modal encoder by the encoder control vector modulating a self-attention mechanism of each modal encoder; And controlling the multi-mode large model to fuse the modal characteristics of the multi-mode information corresponding to the user instruction through the fuse control vector to generate interaction information corresponding to the user instruction, wherein the modal characteristics of the multi-mode information comprise the modal characteristics of the target modal information.
2. The interaction method based on the multi-mode big model according to claim 1, wherein the deep semantic parsing of the user instruction is performed through the routing network to generate the routing vector group, comprising: converting the user instruction into a target vector through a text encoder and a multi-layer perceptron; Slicing the target vector based on a preset dimension through a vector decomposition layer to obtain a plurality of slice vectors; And processing each slice vector according to an activation function corresponding to the slice vector to obtain the routing vector group.
3. The multi-modal large model-based interaction method of claim 1, wherein generating modal characteristics of target modal information input to each of the at least one modal encoder by the encoder control vector modulating a self-attention mechanism of the each modal encoder, comprises: Modulating the query matrix and the key matrix of the self-attention mechanism by the encoder control vector, respectively; And determining the modal characteristics of the target modal information through the modulated query matrix and the modulated key matrix.
4. A multi-modal large model based interaction method as in claim 3 wherein modulating the query matrix and key matrix of the self-attention mechanism by the encoder control vector, respectively, comprises: Determining a modulated query matrix and a modulated key matrix respectively by using a space routing vector and a semantic routing vector in the encoder control vector; the space routing vector is used for indicating the position information of the user instruction requiring attention in the target modal information, and the semantic routing vector is used for indicating the attribute information of the user instruction requiring attention in the target modal information.
5. A multi-modal large model based interaction method as in claim 3 wherein modulating the query matrix and key matrix of the self-attention mechanism by the encoder control vector, respectively, comprises: In the case where the encoder control vector includes a spatial routing vector and a semantic routing vector, the modulated query matrix is determined by the following formula: and determining the modulated key matrix by the following formula: ; Wherein Q represents the query matrix, Q 'represents the modulated query matrix, K represents the key matrix, and K' represents the modulated key matrix; And In order for the projection matrix to be learnable, Representing element-by-element multiplication; For the spatial routing vectors to be used, Is a semantic routing vector.
6. A multi-modal large model based interaction method as in claim 3 wherein determining the modal characteristics of the target modal information from the modulated query matrix and the modulated key matrix comprises: determining the modal characteristics of the target modal information by the following formula, including: ; Wherein Q represents the query matrix, Q 'represents the modulated query matrix, K represents the key matrix, K' represents the modulated key matrix, conditionalAttention (Q, K, v|r) represents the modal characteristics of the target modal information, V represents the value matrix obtained by linear transformation of the target modal information, Representing the dimensions of the key vectors in the key matrix.
7. The multi-modal large model-based interaction method of claim 1, wherein generating modal characteristics of target modal information input to each of the at least one modal encoder by the encoder control vector modulating a self-attention mechanism of the each modal encoder, comprises: Modulating a query matrix of the self-attention mechanism by the spatial routing vector and a first projection bias term and modulating a key matrix of the self-attention mechanism by the semantic routing vector and a second projection bias term, wherein the spatial routing vector is used for indicating position information of the user instruction requiring attention in the target modal information, and the semantic routing vector is used for indicating attribute information of the user instruction requiring attention in the target modal information; And determining the modal characteristics of the target modal information through the attention mask matrix, the modulated query matrix and the modulated key matrix.
8. The multi-modal large model-based interaction method of claim 7, wherein determining modal characteristics of the target modal information by an attention mask matrix, a modulated query matrix, and a modulated key matrix includes: ; Wherein Q represents the query matrix, Q 'represents the modulated query matrix, K represents the key matrix, K' represents the modulated key matrix, conditionalAttention (Q, K, v|r) represents the modal characteristics of the target modal information, V represents the value matrix obtained by linear transformation of the target modal information, Represents the dimensions of the key vectors in the key matrix, M represents the attention mask matrix.
9. The interaction method based on the multi-modal large model according to claim 1, wherein the controlling the multi-modal large model to fuse the modal characteristics of the multi-modal information corresponding to the user instruction through the fuse control vector, generating the interaction information corresponding to the user instruction includes: performing modal gating on the modal characteristics of the multi-modal information through a modal weight vector to obtain gated characteristics; fusing the gated features through the fusion control vector to generate interaction information corresponding to the user instruction; The fusion device control vector comprises the modal weight vector and the fusion control vector, wherein the modal weight vector is used for indicating the relative importance of each modal information to the user instruction, and the fusion control vector is used for indicating the fusion strategy allowed to be adopted for the multi-modal information.
10. The interaction method based on the multi-modal large model according to claim 9, wherein performing modal gating on the modal characteristics of the multi-modal information through a modal weight vector to obtain gated characteristics comprises: The gated features are determined by the following formula: ; Wherein, the Representing the characteristics of the door after the door is closed, Represents the modality characteristics of each modality information, As a function of the sigmoid, , Is the modal weight vector.
11. The interaction method based on the multi-mode big model according to claim 9, wherein the step of generating the interaction information corresponding to the user instruction by fusing the gated features through the fusion control vector comprises the steps of: determining a fusion strategy weight vector corresponding to the fusion control vector through a dynamic fusion device; Fusing the gated features through the fusion strategy weight vector to obtain a fusion vector; and generating the interaction information based on the fusion vector through the multi-modal large model.
12. The interaction method based on the multi-modal large model according to claim 11, wherein determining, by a dynamic fusion device, a fusion policy weight vector corresponding to the fusion control vector includes: determining the fusion policy weight vector by the following formula, including: ; Wherein, the Representing the weight vector of the fusion strategy, And representing the fusion control vector, wherein the fusion control vector is used for specifying a fusion strategy.
13. The interaction method based on the multi-mode big model according to claim 11, wherein the fusing strategy weight vector fuses the gated features to obtain a fused vector, and the method comprises the following steps: determining the fusion vector by the following formula, including: ; Wherein, the The fusion vector is represented by a vector of the fusion vector, For the j-th predefined fusion policy, The number of modes being multi-mode information; Representing the number of predefined fusion policies, The weight of the j-th fusion policy is represented, And is also provided with 。
14. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program when run performs the method of any one of claims 1 to 13.
15. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to perform the method of any of claims 1 to 13 by means of the computer program.

Description

Interaction method based on multi-mode large model, storage medium and electronic device Technical Field The application relates to the technical field of smart families, in particular to an interaction method based on a multi-mode large model, a storage medium and an electronic device. Background In the related art, the multi-mode large model has the technical bottleneck of 1) static and coarse-granularity fusion when processing user instructions, wherein the multi-mode large model usually adopts a fixed fusion architecture (such as early fusion, medium fusion or late fusion). All information of all modalities interact in the same way, regardless of the task indicated by the user instruction. For example, when dealing with a task that "describes this picture" the multimodal model will analyze all details of the entire picture, but when dealing with a specific problem of "there are several cats in the picture. 2) Modal interference and noise in multi-modal information, not all modalities are equally important to the current task. For example, in answering a video-based "what happens" question, visual information may dominate, while audio may be background music, belonging to noise. The multi-mode large model equally treats all modes, so that the judgment of a secondary mode or a noise mode to interfere a primary mode is easy to cause, and the reasoning precision is reduced. 3) Task intention understanding bias-understanding of user instructions by the multi-modal large model is relatively decoupled from subsequent multi-modal information processing flows. The large model may understand the user instructions, but its internal visual encoder, cage, etc. components are not precisely "guided" to focus on the information most relevant to the user instructions, resulting in intent and perception mismatch. That is, in the related art, there is a problem that the multi-mode large model fusion architecture is fixed, and the task understanding and the multi-mode information processing flow are relatively decoupled for treating all the mode information equally, so that the multi-mode large model cannot dynamically process the multi-mode information based on the user instruction. Aiming at the problems that a multi-mode large model cannot dynamically process multi-mode information based on user instructions and the like in the related technology, an effective solution is not proposed. Accordingly, there is a need for improvements in the related art to overcome the drawbacks of the related art. Disclosure of Invention The embodiment of the application provides an interaction method, a storage medium and an electronic device based on a multi-mode large model, which at least solve the problem that the multi-mode large model in the related art cannot dynamically process multi-mode information based on user instructions. According to one aspect of the embodiment of the application, an interaction method based on a multi-mode big model is provided, wherein the method comprises the steps of carrying out deep semantic analysis on user instructions through a routing network to generate a routing vector group, wherein the routing vector group comprises an encoder control vector and a fusion device control vector, the encoder control vector is used for controlling at least one mode encoder in the multi-mode big model, a self-attention mechanism of each mode encoder in the at least one mode encoder is modulated through the encoder control vector to generate mode characteristics of target mode information input into each mode encoder, and the fusion device control vector is used for controlling the multi-mode big model to fuse the mode characteristics of multi-mode information corresponding to the user instructions to generate interaction information corresponding to the user instructions, and the mode characteristics of the multi-mode information comprise the mode characteristics of the target mode information. In an exemplary embodiment, a routing network is used for carrying out deep semantic analysis on a user instruction to generate a routing vector group, the routing network comprises a text encoder, a multi-layer perceptron and a vector decomposition layer, the user instruction is converted into a target vector through the text encoder and the multi-layer perceptron, the target vector is sliced based on a preset dimension through the vector decomposition layer to obtain a plurality of slice vectors, each slice vector is processed according to an activation function corresponding to each slice vector to obtain the routing vector group, and the routing network comprises the text encoder, the multi-layer perceptron connected with the text encoder and the vector decomposition layer connected with the multi-layer perceptron. In an exemplary embodiment, the encoder control vector is used for modulating the self-attention mechanism of each modal encoder in the at least one modal encoder to generate the modal characteristics of the