CN-121980538-A - Data processing method, device, equipment and storage medium based on visual language model

CN121980538ACN 121980538 ACN121980538 ACN 121980538ACN-121980538-A

Abstract

The invention discloses a data processing method, a device, equipment and a storage medium based on a visual language model, which relate to the fields of artificial intelligence, deep learning and computer vision and can be applied to the fields of finance and medical treatment, and comprise the steps of obtaining an input image and an input text; the method comprises the steps of inputting an input image into a pre-trained first encoder to obtain a discrete visual feature sequence, sequentially inputting the discrete visual feature sequence into a pre-trained feature mapping network and a pre-trained second encoder to obtain a continuous visual feature sequence, encoding an input text to obtain a text feature sequence, fusing the continuous visual feature sequence and the text feature sequence to obtain a multi-modal feature sequence, and inputting the multi-modal feature sequence into a visual language model to process to obtain task results related to the input image and the input text. According to the scheme, the problem that the generating task and the understanding task are difficult to cooperate due to the target difference under the single autoregressive model frame is solved, and the overall performance and the generalization adaptability are improved.

Inventors

ZHANG XULONG
LU RENJIE

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. A method for processing data based on a visual language model, comprising: Acquiring an input image and an input text; inputting the input image into a pre-trained first encoder to obtain a discrete visual feature sequence; Sequentially inputting the discrete visual feature sequences into a pre-trained feature mapping network and a pre-trained second encoder to obtain a continuous visual feature sequence; Encoding the input text to obtain a text feature sequence; Fusing the continuous visual feature sequence and the text feature sequence to obtain a multi-mode feature sequence; And inputting the multi-modal feature sequence into a visual language model for processing to obtain a task result associated with the input image and the input text.
2. The method of claim 1, wherein said inputting the input image into a pre-trained first encoder to obtain a sequence of discrete visual features comprises: dividing the input image into a plurality of image blocks; encoding each image block respectively to obtain potential feature vectors corresponding to each image block; Mapping each potential feature vector to a predefined fixed codebook to obtain a corresponding discrete codebook embedded vector; And according to the spatial arrangement sequence of the image blocks in the input image, arranging the corresponding discrete codebook embedded vectors to obtain a discrete visual characteristic sequence.
3. The method of claim 1, wherein said fusing the continuous visual feature sequence with the text feature sequence to obtain a multimodal feature sequence comprises: Performing linear projection transformation on the continuous visual feature sequence to obtain a transformed continuous visual feature sequence, wherein the dimension of the transformed continuous visual feature sequence is the same as the dimension of the text feature sequence; respectively adding a visual start mark and a visual end mark at the start end and the end of the transformed continuous visual feature sequence to obtain a marked continuous visual feature sequence; and splicing the marked continuous visual feature sequence and the text feature sequence according to a preset sequence to obtain a multi-mode feature sequence.
4. The method of claim 1, wherein said processing the multimodal sequence of features into a visual language model to obtain task results associated with the input image and input text comprises: inputting the multimodal feature sequence into the visual language model; processing the multi-modal feature sequence by the visual language model in an autoregressive manner to generate an output sequence; and determining the task result according to the task type, wherein the output sequence is used as an answer in a text form if the task is visually understood, and the output sequence is input into an image decoder to generate a target image if the task is visually generated.
5. The method of claim 1, wherein sequentially inputting the sequence of discrete visual features into a pre-trained feature mapping network and a pre-trained second encoder to obtain a sequence of continuous visual features comprises: Acquiring at least one supplemental image associated with the input image; Inputting the supplemental image into the pre-trained first encoder to obtain a supplemental discrete visual feature sequence; The supplementary discrete visual feature sequence and the discrete visual feature sequence are polymerized to obtain a polymerized discrete visual feature sequence; And sequentially inputting the aggregated discrete visual feature sequences into a pre-trained feature mapping network and a pre-trained second encoder to obtain a continuous visual feature sequence.
6. The method of claim 1, wherein the feature mapping network and the second encoder are obtained by a training process comprising: Acquiring a plurality of training samples, wherein each training sample comprises a training image and a corresponding description text; for each of the training samples, performing the steps of: inputting the training image into a pre-trained first encoder to obtain a training discrete visual characteristic sequence; Sequentially inputting the training discrete visual feature sequence into a feature mapping network to be trained and a second encoder to be trained, obtaining a training continuous visual feature sequence, and determining a first global semantic representation of the training image based on the training continuous visual feature sequence; Calculating a distillation loss from the first global semantic representation and a second global feature representation extracted from the same training image by a pre-trained teacher visual encoder; calculating a contrast loss according to the first global semantic representation and a text feature representation extracted from a descriptive text corresponding to the same training image by a pre-trained teacher text encoder; determining a joint optimization loss based on the distillation loss and the contrast loss; Updating parameters of the feature mapping network and the second encoder with the goal of minimizing the joint optimization penalty while keeping parameters of the pre-trained first encoder unchanged; And iteratively executing the steps until the training converges to obtain a trained feature mapping network and a trained second encoder.
7. A data processing apparatus based on a visual language model, comprising: The acquisition module is used for acquiring an input image and an input text; The first processing module is used for inputting the input image into a pre-trained first encoder to obtain a discrete visual characteristic sequence; The second processing module is used for sequentially inputting the discrete visual feature sequences into a pre-trained feature mapping network and a pre-trained second encoder to obtain a continuous visual feature sequence; the coding module is used for coding the input text to obtain a text feature sequence; The fusion module is used for fusing the continuous visual feature sequence and the text feature sequence to obtain a multi-mode feature sequence; And the third processing module is used for inputting the multi-modal feature sequence into a visual language model for processing to obtain task results associated with the input image and the input text.
8. The apparatus of claim 7, wherein the first processing module is further configured to: dividing the input image into a plurality of image blocks; encoding each image block respectively to obtain potential feature vectors corresponding to each image block; Mapping each potential feature vector to a predefined fixed codebook to obtain a corresponding discrete codebook embedded vector; And according to the spatial arrangement sequence of the image blocks in the input image, arranging the corresponding discrete codebook embedded vectors to obtain a discrete visual characteristic sequence.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the data processing method based on a visual language model according to any one of claims 1 to 6 when executing the computer program.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the visual language model based data processing method according to any one of claims 1 to 6.

Description

Data processing method, device, equipment and storage medium based on visual language model Technical Field The invention relates to the fields of artificial intelligence, deep learning and computer vision, which can be applied to the fields of finance and medical treatment, in particular to a data processing method, a device, equipment and a storage medium based on a visual language model. Background In recent years, an autoregressive model framework capable of uniformly processing multi-modal understanding and generating tasks is constructed, is an important research direction in the current artificial intelligence field, and has urgent application requirements in fields with extremely high information processing accuracy requirements, such as finance, medical treatment and the like. For example, in financial analysis, the model is required to extract trend information from complex charts and generate analysis reports simultaneously, and in medical diagnosis, the model is required to generate a structured diagnosis description while interpreting medical images. Such applications all require models that have both accurate visual understanding and reliable text generation capabilities within a single framework. However, it is difficult in the prior art to effectively coordinate these two types of tasks within the unified autoregressive model framework described above. The main stream scheme generally follows two paths, namely, an image is encoded into a discrete word sequence based on vector quantization and other technologies, the method can retain details to support a generating task, but the obtained features lack high-level semantics to cause limited understanding capability, and the continuous semantic features extracted by a pre-training visual encoder are discretized, so that semantic understanding can be improved, but inherent conflict exists between an image pixel level reconstruction target and a semantic alignment target, so that training instability and information loss are often caused. Both schemes cannot realize the detail reservation required by generating tasks and the semantic extraction required by understanding the tasks in one autoregressive model frame, so that the targets of the two tasks are difficult to cooperatively optimize. Disclosure of Invention The embodiment of the invention provides a data processing method, device, equipment and storage medium based on a visual language model, which are used for solving the problem that a generating task and an understanding task are difficult to cooperate due to target difference under a single autoregressive model frame. In a first aspect, a method for processing data based on a visual language model is provided, including: Acquiring an input image and an input text; inputting the input image into a pre-trained first encoder to obtain a discrete visual feature sequence; Sequentially inputting the discrete visual feature sequences into a pre-trained feature mapping network and a pre-trained second encoder to obtain a continuous visual feature sequence; Encoding the input text to obtain a text feature sequence; Fusing the continuous visual feature sequence and the text feature sequence to obtain a multi-mode feature sequence; And inputting the multi-modal feature sequence into a visual language model for processing to obtain a task result associated with the input image and the input text. In a second aspect, there is provided a data processing apparatus based on a visual language model, comprising: The acquisition module is used for acquiring an input image and an input text; The first processing module is used for inputting the input image into a pre-trained first encoder to obtain a discrete visual characteristic sequence; The second processing module is used for sequentially inputting the discrete visual feature sequences into a pre-trained feature mapping network and a pre-trained second encoder to obtain a continuous visual feature sequence; the coding module is used for coding the input text to obtain a text feature sequence; The fusion module is used for fusing the continuous visual feature sequence and the text feature sequence to obtain a multi-mode feature sequence; And the third processing module is used for inputting the multi-modal feature sequence into a visual language model for processing to obtain task results associated with the input image and the input text. In a third aspect, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the above-mentioned data processing method based on a visual language model when executing the computer program. In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program, which when executed by a processor, implements the above-described data processing method based on a visual language model. The technica