CN-122021659-A - Image-text question-answering method and system based on mixed expert architecture visual language model

CN122021659ACN 122021659 ACN122021659 ACN 122021659ACN-122021659-A

Abstract

The application provides a graph-text question-answering method and a graph-text question-answering system based on a mixed expert framework visual language model, wherein the method comprises the steps of constructing a calibration data set, inputting the calibration data set into the visual language model, and generating a visual Token and a text Token through a modal encoder in the visual language model; the method comprises the steps of extracting activation distribution characteristics of each expert module based on a visual Token and a text Token, obtaining quantization error weights of each expert module based on the activation distribution characteristics, obtaining affinities of the visual Token and the text Token for each expert module based on the activation distribution characteristics, carrying out quantization processing on visual language models based on the affinities and the quantization error weights to obtain quantized visual language models, inputting an image to be analyzed and a text into the quantized visual language models, and generating image-text interaction question-answering results. The application can realize high-precision reasoning of the visual language model under the budget of low bits, thereby realizing accurate and quick image-text question-answering.

Inventors

ZHANG YULUN
Qin Guangshuo
LI ZHITENG
KONG LINGHE
YANG XIAOKANG

Assignees

上海交通大学

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (10)

1. A picture and text question answering method based on a mixed expert framework visual language model is characterized by comprising the following steps: acquiring an image to be analyzed and a text interaction instruction, and determining a visual language model for completing pre-training; constructing a calibration data set and inputting the calibration data set into a pre-trained visual language model, and generating a visual Token and a text Token through a modal encoder built in the visual language model, wherein the visual language model is constructed based on a mixed expert framework and comprises a plurality of mixed expert layers, and each mixed expert layer comprises a plurality of expert modules which are arranged in parallel; Based on the visual Token and the text Token, extracting the activation distribution characteristics of each expert module on different modes; Based on the activation distribution characteristics, acquiring quantization error weights of all expert modules; based on the activation distribution characteristics, obtaining the affinity of the visual Token and the text Token to each expert module, and carrying out quantization processing on the pre-trained visual language model based on the affinity and the quantization error weight to obtain a quantized visual language model; Inputting the image to be analyzed and the text interaction instruction into the quantized visual language model, and determining and generating a corresponding image-text interaction question-answering result.
2. The method for text question answering based on a mixed expert architecture visual language model according to claim 1, wherein the steps of constructing and inputting the calibration data set into the pre-trained visual language model, generating the visual Token and the text Token through a modality encoder built in the visual language model include: Constructing a calibration data set, the calibration data set comprising a number of image-text pairs; Inputting the calibration data set into a pre-trained visual language model comprising a visual encoder, an embedder and a hybrid expert architecture based large language model, wherein the hybrid expert architecture based large language model is composed of a plurality of stacked hybrid expert layers, each hybrid expert layer comprising a gating network and a plurality of expert modules, The method comprises the steps of processing an image in an image-text pair through a visual encoder, extracting and generating visual features, performing dimension and semantic alignment on the visual features and a text embedding space of a large language model through an embedder to generate a visual Token, and processing a text in the image-text pair through a text encoder in the large language model to generate a text Token.
3. The method for text question answering based on the mixed expert architecture visual language model according to claim 1, wherein the extracting the activation distribution characteristics of each expert module to the visual mode and the text mode based on the visual Token and the text Token comprises: Distributing labels for all Token, wherein the visual Token corresponds to the visual label and the text Token corresponds to the text label; Inputting the visual Token and the text Token with labels into each layer of mixed expert layers of the large language model; The input end of each expert module is provided with an independent visual counter and a text counter, and after the expert module is selected for a single Token with a label by a gating network of a certain layer of mixed expert layer, the counter corresponding to the selected expert module is counted based on the label type of the Token; aiming at a single Token with a label input to each mixed expert layer, acquiring a gating value corresponding to each expert module in the mixed expert layer, wherein the gating value is output by a gating network in the mixed expert layer; Taking the accumulated total number of the visual counter as the visual Token activation number, taking the accumulated total number of the text counter as the text Token activation number, and taking the visual Token activation number, the text Token activation number and the gating value corresponding to the expert module as the activation distribution characteristics.
4. A method for text-to-text question answering based on a mixed expert architecture visual language model according to claim 3, wherein the Token-based tag type counts the counter corresponding to the selected expert module, and the method comprises: if the label type of Token is a visual label, the visual counter corresponding to the expert module counts and accumulates 1, and if the label type of Token is a text label, the text counter corresponding to the expert module counts and accumulates 1.
5. A method for obtaining a text question and answer based on a visual language model of a hybrid expert framework according to claim 3, wherein the obtaining the quantization error weight of each expert module based on the activation distribution feature comprises: For each expert module, multiplying the visual Token activation quantity by a preset first parameter to obtain a first product, multiplying the text Token activation quantity by a preset second parameter to obtain a second product, and taking the sum of the first product and the second product as the importance of the expert module; Aiming at each mixed expert layer, all expert modules are ranked from high importance to low importance, quantization error weights are given to each expert module according to the relative magnitude of the importance, and the importance and the quantization error weights are in a linear relation.
6. The method for text question answering based on a mixed expert architecture visual language model according to claim 3, wherein the obtaining the affinity of the visual Token and the text Token to each expert module based on the activation distribution feature, and the quantizing the pre-trained visual language model based on the affinity and the quantization error weight, to obtain the quantized visual language model, comprises: sequentially performing temperature calibration, sparse calibration and normalization treatment on each gating value to obtain the affinity of each Token with a label to each expert module; Defining a quantization loss function based on the quantization error weights; constructing an enhanced hessian matrix based on the quantization loss function and the affinity, and inverting the enhanced hessian matrix to obtain an enhanced hessian inverse matrix; based on the enhanced hessian inverse matrix, the expert weight and quantization parameters of each expert module are quantized and updated to realize the quantization processing of the pre-trained visual language model.
7. The method for text-to-text question answering based on the mixed expert architecture visual language model according to claim 6, wherein the expression of the quantization loss function is as follows: ; Wherein L is total loss, L' is task loss of the visual language model, G is index of the mixed expert layer, G is total number of the mixed expert layers, M is total number of expert modules in a single mixed expert layer, j is index of the expert modules in the mixed expert layer of the G layer; the quantization error weight of the j expert module in the g-th mixed expert layer; the weight matrix of the j expert module in the g mixed expert layer is used; the input of the jth expert module in the g-layer mixed expert layer; The weight matrix of the j expert module in the quantized g-th mixed expert layer is used; the square of the Frobenius norm of the quantization error is shown.
8. The method for text-to-text question answering based on the mixed expert framework visual language model according to claim 6, wherein the expression of the enhanced hessian matrix is as follows: ; Wherein H represents an enhanced hessian matrix of each mixed expert layer; Second order partial derivatives representing quantization loss functions; Indicating the affinity of the ith Token to the jth expert module; Representing a preset modal weighting factor.
9. An image-text question-answering system based on a mixed expert architecture visual language model is characterized by comprising: the acquisition module is used for acquiring an image to be analyzed and a text interaction instruction and determining a visual language model for completing pre-training; The multi-mode coding module is used for constructing a calibration data set and inputting the calibration data set into a pre-trained visual language model, a visual Token and a text Token are generated through a mode encoder built in the visual language model, the visual language model is constructed based on a mixed expert framework and comprises a plurality of mixed expert layers, and each mixed expert layer comprises a plurality of expert modules which are arranged in parallel; the feature extraction module is used for extracting the activation distribution features of the expert modules on different modes based on the visual Token and the text Token; The weight generation module is used for acquiring the quantization error weight of each expert module based on the activation distribution characteristics; The quantization module is used for acquiring the affinity of the visual Token and the text Token to each expert module based on the activation distribution characteristics, and carrying out quantization processing on the pre-trained visual language model based on the affinity and the quantization error weight to obtain a quantized visual language model; And the question-answering module is used for inputting the image to be analyzed and the text interaction instruction into the quantized visual language model, and determining and generating a corresponding image-text interaction question-answering result.
10. An electronic device, comprising: at least one memory for storing program instructions; at least one processor for invoking program instructions stored in the memory and performing the steps of the method according to any of claims 1-8 in accordance with the obtained program instructions.

Description

Image-text question-answering method and system based on mixed expert architecture visual language model Technical Field The application relates to the technical field of deep learning, in particular to an image-text question-answering method and system based on a mixed expert framework visual language model. Background With the breakthrough progress of large-scale Visual Language Models (VLM) on tasks such as multi-modal understanding, image description generation, interactive questions and answers, and cross-modal reasoning, the model scale is continuously expanding. The current mainstream visual language model consists of a visual encoder, an embedder and a plain text large language model, and in the aspect of the plain text large language model, a mixed expert model (Mixture-of-expertise) structure appears in recent years, and by introducing a sparse expert routing mechanism, the good balance between the model capacity and the calculation cost is realized, so that the performance, the generalization capability and the expansibility of the model are remarkably improved. Recently released visual language models, the text large language model part of which also adopts a mixed expert structure. However, the rapid development of mixed expert architecture visual language models also presents new challenges of 1, extremely large model parameters (e.g., billions to billions of levels), and far beyond the general model for memory and computing resources. 2. Sparse expert structures cause an imbalance in distribution, resulting in quantization difficulties that are more complex than dense models. 3. Practical deployment is limited, and it is obvious that mobile terminals, edge terminals and embedded terminals still have difficulty in efficiently running such models. 4. The reasoning costs are too high to achieve adequate throughput improvement in resource-constrained GPU clusters. According to the technical scheme, the Chinese patent with the publication number of CN120996211A is searched, a remote sensing multi-mode reasoning method based on a hybrid expert mechanism is provided, a visual encoder, a visual projection layer and a word embedding layer are used as core modules, the visual projection layer maps visual token to a feature space aligned with the hiding size of a large language model, compatibility of vision and text features is ensured, the word embedding layer generates a text token sequence, and efficient representation is provided for multi-mode feature fusion. Through the depth fusion of vision and text features, the understanding capability of complex scenes is remarkably improved, and meanwhile, the high efficiency and the universality are maintained. However, the method cannot solve the problem that the model parameters are unevenly distributed due to the sparse expert structure, so that the quantization precision is seriously lost. Based on the above, there is a need for an image-text question-answering method based on a quantized visual language model, which can minimize the influence of quantization on the performance of the model aiming at the special architecture of the newly emerging mixed expert architecture visual language model, so as to overcome the defects of insufficient applicability and serious quantization performance loss of the traditional method, and further realize accurate and rapid image-text question-answering. Disclosure of Invention Aiming at the defects in the prior art, the application aims to provide an image-text question-answering method and system based on a mixed expert framework visual language model. According to a first aspect of the present application, there is provided a method for text-to-text question answering based on a visual language model of a hybrid expert architecture, comprising: acquiring an image to be analyzed and a text interaction instruction, and determining a visual language model for completing pre-training; constructing a calibration data set and inputting the calibration data set into a pre-trained visual language model, and generating a visual Token and a text Token through a modal encoder built in the visual language model, wherein the visual language model is constructed based on a mixed expert framework and comprises a plurality of mixed expert layers, and each mixed expert layer comprises a plurality of expert modules which are arranged in parallel; Based on the visual Token and the text Token, extracting the activation distribution characteristics of each expert module on different modes; Based on the activation distribution characteristics, acquiring quantization error weights of all expert modules; based on the activation distribution characteristics, obtaining the affinity of the visual Token and the text Token to each expert module, and carrying out quantization processing on the pre-trained visual language model based on the affinity and the quantization error weight to obtain a quantized visual language model; Inputting the ima