CN-121010837-B - Data processing method and electronic equipment

CN121010837BCN 121010837 BCN121010837 BCN 121010837BCN-121010837-B

Abstract

The application discloses a data processing method and electronic equipment, and relates to the technical field of computers, wherein the method comprises the steps of acquiring an input text and an input image; the method comprises the steps of generating text marks corresponding to input texts, determining image types of input images, determining an activated visual feature extraction model according to the image types of the input images, extracting visual features of the input images by using the activated visual feature extraction model, generating the image marks corresponding to the input images based on the visual features output by the activated visual feature extraction model, and inputting the text marks and the image marks into a pre-training language model so that the pre-training language model generates answers corresponding to the input texts and the input images. The application improves the processing performance of the pre-training language model on different types of image data.

Inventors

SHEN QIANG
WANG SHENLING
WU SHAOHUA
ZHANG XIAOLAN

Assignees

浪潮电子信息产业股份有限公司

Dates

Publication Date: 20260508
Application Date: 20251028

Claims (14)

1. A method of data processing, comprising: Acquiring an input text and an input image; generating a text mark corresponding to the input text; Determining an image category of the input image, and determining an activated visual feature extraction model according to the image category of the input image, wherein the image category is a content type represented by the input image and comprises any one or a combination of any several of text, documents and natural scenes; performing visual feature extraction on the input image by using the activated visual feature extraction model, and generating an image mark corresponding to the input image based on the visual features output by the activated visual feature extraction model; inputting the text mark and the image mark into a pre-training language model so that the pre-training language model generates answers corresponding to the input text and the input image; wherein determining the image category of the input image comprises: determining an image category of the input image using an image classification model; Wherein generating an image tag corresponding to the input image based on the visual features output by the activated visual feature extraction model, comprises: If a plurality of activated visual extraction models exist, unifying visual features output by the plurality of activated visual extraction models into target visual features with preset scale and preset resolution; respectively determining weights corresponding to the plurality of activated visual extraction models, weighting the target visual features based on the weights corresponding to the plurality of activated visual extraction models, and generating image marks corresponding to the input images; Wherein determining an activated visual feature extraction model from an image class of the input image comprises: determining the probability of each type of label output by the image classification model, and determining a visual feature extraction model corresponding to the type label with the probability larger than a preset value as an activated visual feature extraction model; correspondingly, determining weights corresponding to the plurality of activated visual extraction models respectively includes: and determining the weight corresponding to the activated visual extraction model as the probability of the corresponding category label output by the image classification model.
2. The data processing method according to claim 1, wherein the image classification model comprises an input layer, a start convolution layer, a plurality of stage layers, a pooling layer and a multi-label classification layer which are sequentially connected, the input layer is used for receiving the input image, the start convolution layer comprises a convolution kernel with a preset size, the stage layers comprise a plurality of moving inversion bottleneck convolution modules, the pooling layer is used for converting a feature map output by a last stage layer into a feature vector, and the multi-label classification layer comprises a plurality of parallel full-connection layers which are respectively used for outputting probabilities of corresponding different class labels.
3. The data processing method of claim 1, wherein determining an activated visual feature extraction model based on an image class of the input image further comprises: determining the basic visual feature extraction model as an activated visual feature extraction model; correspondingly, determining weights corresponding to the plurality of activated visual extraction models respectively, further comprises: and determining the weight corresponding to the basic visual characteristic extraction model as 1.
4. A data processing method according to claim 3, wherein the visual feature extraction of the input image using the base visual feature extraction model comprises: Dividing the input image into a plurality of non-overlapping image blocks, and performing linear embedding on each image block to map each image block into an embedding vector with a preset dimension; Constructing a learnable category label and applying position codes to all labels, wherein all labels comprise the category labels and embedded vectors corresponding to each image block; inputting the marks after position coding into a multi-layer converter encoder, wherein each layer of converter encoder comprises a multi-head self-attention module, a first residual error connection and layer normalization, a feedforward neural network and a second residual error connection and layer normalization which are sequentially connected; And outputting the feature vector corresponding to the category label.
5. A data processing method according to claim 1 or 3, further comprising, after determining the weight corresponding to the activated visual extraction model as the probability of the corresponding class label output by the image classification model: And receiving an adjustment command for the weight corresponding to the activated visual extraction model through the input interface, and responding to the adjustment command so as to adjust the weight corresponding to the activated visual extraction model.
6. The data processing method of claim 1, wherein determining the image class of the input image and determining the activated visual feature extraction model based on the image class of the input image comprises: a model determination instruction is received through the input interface, and an activated visual feature extraction model is determined based on the model determination instruction.
7. The data processing method according to claim 1, wherein the generating an image mark corresponding to the input image based on the visual features output by the activated visual feature extraction model includes: and if only one activated visual extraction model exists, taking the visual features output by the activated visual extraction model as image marks corresponding to the input image.
8. The data processing method of claim 1, wherein if the activated visual feature extraction model includes a text-corresponding visual feature extraction model including a transformer architecture-based optical character recognition model, performing visual feature extraction on the input image using the activated visual feature extraction model, comprising: dividing the input image into a plurality of non-overlapping image blocks, performing linear embedding on each of the image blocks to map each of the image blocks into an embedded vector of a preset dimension, and applying position coding to the embedded vector; The embedded vector after the position coding is input into a multi-layer converter coding layer to extract visual characteristics, the extracted visual characteristics are input into a converter decoder, and a text sequence is generated by using a cross attention and autoregressive mode; and performing activation function and bundle search processing on the text sequence to obtain a text recognition result as a visual feature output by a visual feature extraction model corresponding to the text.
9. The data processing method according to claim 1, wherein if the activated visual feature extraction model includes a visual feature extraction model corresponding to a document, the visual feature extraction model corresponding to the document including a layout-aware multimodal document understanding model, performing visual feature extraction on the input image using the activated visual feature extraction model, comprising: Respectively performing text embedding, image embedding and position embedding on the input image to respectively obtain a text vector, an image vector and a position vector; fusing the text vector, the image vector and the position vector by using a multi-modal converter encoder to obtain multi-modal characteristics; executing a unified mask prediction task based on the multi-mode features to obtain a prediction result, wherein the prediction result comprises text information, image information and layout information; and inputting the prediction result into a task specific head, and taking document structure information or a question-answer result output by the task specific head as a visual feature corresponding to the document to extract visual features output by a model.
10. The data processing method of claim 9, wherein the pre-training process of the layout-aware multimodal document understanding model comprises: Constructing a training data set, and determining a labeling result of a training document image in the training data set, wherein the labeling result comprises a text bounding box and a semantic category label; executing at least one cross-modal pre-training task based on the training data set to train the layout-aware multi-modal document understanding model to understand the association relationship between text information, image information and layout information; The pre-training task comprises any one or a combination of any of a cross-modal alignment matching task, a unified mask prediction task and a document relation reasoning task.
11. The data processing method according to claim 1, wherein if the activated visual feature extraction model includes a visual feature extraction model corresponding to a natural scene, the visual feature extraction model corresponding to the natural scene includes an unsupervised visual transducer model based on unlabeled knowledge distillation, performing visual feature extraction on the input image using the activated visual feature extraction model, comprising: respectively performing local clipping and global clipping on the input image through a data enhancement module to respectively obtain a first enhanced image and a second enhanced image; extracting a first image block marking sequence of the first enhanced image through a student vision transformer backbone network, and extracting a second image block marking sequence of the second enhanced image through a teacher vision transformer backbone network; Calculating a self-distillation loss based on the first image block marker sequence and the second image block marker sequence; and updating teacher network parameters through momentum based on the self-distillation loss, and outputting a first image block marking sequence as a visual feature output by a visual feature extraction model corresponding to the natural scene.
12. The data processing method of claim 11, further comprising, after extracting the first sequence of image block markers for the first enhanced image via a student vision transformer backbone network: inputting the multi-scale first image block marking sequence into a feature pyramid network to perform feature fusion to obtain a fusion feature map; Inputting the fusion feature map to a shared prediction head network for feature coding, and inputting coding features output by the shared prediction head network to a plurality of parallel task branches to obtain an instance segmentation result, wherein the instance segmentation result comprises a category of a target, a boundary frame coordinate and a pixel level mask; Extracting corresponding example region features from the first image block marker sequence based on the example segmentation result; correspondingly, outputting the first image block marking sequence as the visual feature output by the visual feature extraction model corresponding to the natural scene comprises the following steps: and outputting the example region features as visual features output by the visual feature extraction model corresponding to the natural scene.
13. The data processing method according to claim 1, wherein if the activated visual feature extraction model includes a visual feature extraction model corresponding to text, a visual feature extraction model corresponding to document, and a visual feature extraction model corresponding to natural scene, generating the image mark corresponding to the input image based on the visual feature output by the activated visual feature extraction model, comprises: The visual characteristics output by the visual characteristic extraction model corresponding to the text, the visual characteristic extraction model corresponding to the document and the visual characteristic extraction model corresponding to the natural scene are respectively mapped into preset characteristic dimensions, so that text sequence characteristics, first space characteristics and second space characteristics are respectively obtained; taking the text sequence feature as a query, the first space feature as a key and the second space feature as a value, and executing cross-modal attention calculation to obtain a fused multi-modal feature representation; and generating an image mark corresponding to the input image based on the fused multi-modal characteristic representation.
14. An electronic device, comprising: A memory for storing a computer program; Processor for implementing the steps performed by the data processing method according to any of claims 1 to 13 when said computer program is executed.

Description

Data processing method and electronic equipment Technical Field The present application relates to the field of computer technology, and more particularly, to a data processing method and an electronic device. Background In the related art, a single visual feature extraction model (same model structure and parameters) is generally employed to process all types of input images. However, different types of images have distinct feature distributions and task requirements, and if these different types of data are processed using a unified visual feature extraction model, the model may not adequately extract the features required for a particular task, resulting in reduced performance. Therefore, how to improve the processing performance of the pre-trained language model on different types of data is a technical problem that needs to be solved by those skilled in the art. Disclosure of Invention The application aims to provide a data processing method, a device, equipment, a storage medium and a computer program product, which improve the processing performance of a pre-training language model on different types of data. The application provides a data processing method for achieving the aim, which comprises the steps of obtaining an input text and an input image, generating a text mark corresponding to the input text, determining an image type of the input image, determining an activated visual feature extraction model according to the image type of the input image, carrying out visual feature extraction on the input image by using the activated visual feature extraction model, generating an image mark corresponding to the input image based on the visual feature output by the activated visual feature extraction model, and inputting the text mark and the image mark into a pre-training language model so that the pre-training language model generates answers corresponding to the input text and the input image. In order to achieve the above object, the application provides a data processing device, which comprises an acquisition module, a first generation module, a determination module, a second generation module and an input module, wherein the acquisition module is used for acquiring an input text and an input image, the first generation module is used for generating a text mark corresponding to the input text, the determination module is used for determining an image type of the input image and determining an activated visual feature extraction model according to the image type of the input image, the second generation module is used for carrying out visual feature extraction on the input image by utilizing the activated visual feature extraction model and generating an image mark corresponding to the input image based on the visual feature output by the activated visual feature extraction model, and the input module is used for inputting the text mark and the image mark into a pre-training language model so that the pre-training language model generates answers corresponding to the input text and the input image. In order to achieve the above object, the present application provides an electronic device comprising a memory for storing a computer program, and a processor for implementing the steps of the data processing method as described above when executing the computer program. To achieve the above object, the present application provides a computer-readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the data processing method as described above. To achieve the above object, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the data processing method as described above. According to the data processing method based on the pre-training language model, the type of the input image is determined, and the corresponding visual feature extraction model is activated according to the type, so that the system can adopt the most suitable feature extraction strategy for different types of images by the dynamic selection mechanism, and the accuracy and pertinence of feature extraction are remarkably improved. Next, visual feature extraction is performed on the input image using the activated visual feature extraction model, and image markers are generated based on these features. These image tags are input into the pre-trained language model along with text tags corresponding to the input text, enabling the pre-trained language model to more accurately understand and process multimodal data. Therefore, the application not only improves the processing performance of the pre-training language model on different types of image data, but also enhances the overall performance of the pre-training language model in a multi-modal data fusion scene, thereby realizing more efficient and more accurate multi-modal data processing. The application also discloses