CN-122024253-A - Dynamic resolution visual language model training method, detection method and equipment

CN122024253ACN 122024253 ACN122024253 ACN 122024253ACN-122024253-A

Abstract

The application provides a training method, a detection method and equipment for a dynamic resolution visual language model, wherein the method comprises the steps of acquiring a multi-mode document data set and a plurality of inquiry text feature vectors; training a preset deep learning model to enable the deep learning model to detect each document image in a training process to generate a plurality of visual area diagrams with different resolutions, wherein each visual area diagram corresponds to each document image, extracting and fusing each visual area diagram to obtain an image feature vector, calculating target loss based on the image feature vector and a corresponding query text feature vector to update model parameters of the deep learning model to obtain a trained deep learning model, and taking the deep learning model as a dynamic resolution visual language model for searching a document page corresponding to the query text feature vector from the plurality of document images. The application can effectively solve the problems of limited retrieval precision and the like caused by the fact that the traditional method ignores complex layout and visual details in the document due to the input of fixed resolution.

Inventors

YANG DONGMING

Assignees

北京邮电大学

Dates

Publication Date: 20260512
Application Date: 20260209

Claims (10)

1. A method for training a dynamic resolution visual language model, the method comprising: acquiring a multi-mode document dataset and a plurality of query text feature vectors; Training a preset deep learning model based on each query text feature vector and each document image in the multi-mode document dataset, so that the deep learning model detects each document image to generate a plurality of visual area diagrams with different resolutions corresponding to each document image respectively in the training process, and the deep learning model extracts and fuses each visual area diagram to obtain an image feature vector, so that the deep learning model calculates target loss based on the image feature vector and the corresponding query text feature vector to update model parameters of the deep learning model, and the trained deep learning model is obtained; the deep learning model is used as a dynamic resolution visual language model for retrieving a document page corresponding to a query text feature vector from a plurality of document images.
2. The dynamic resolution visual language model training method according to claim 1, wherein the deep learning model comprises: The dynamic resolution image extraction layer is used for detecting and dividing each semantic entity in the document image so as to output a visual representation structure diagram corresponding to the document image, each visual region diagram and boundary frame coordinates corresponding to each semantic entity, wherein the visual representation structure diagram consists of nodes and edges between the nodes, the nodes are each visual region diagram, the edges are space adjacent relations, semantic correlation relations or logic membership relations between each visual region, and the visual region diagram comprises a global diagram, a combined region diagram and a solid region diagram; The visual feature refining layer is used for extracting the feature vectors corresponding to the visual region graphs respectively, aligning and fusing the feature vectors corresponding to the visual region graphs respectively based on the boundary frame coordinates and the visual representation structure diagram to output image feature vectors, wherein the feature vectors corresponding to the visual region graphs respectively comprise global graph visual feature vectors, combined region visual feature vectors and entity region visual feature vectors; And the cross-modal alignment layer is used for calculating the delayed interactive similarity of the image feature vector and the query text feature vector to obtain a similarity score, aligning the image feature vector and the query text feature vector by using a symmetrical contrast loss function to obtain symmetrical contrast loss, combining the symmetrical contrast loss and each auxiliary task loss to obtain the target loss, and updating model parameters of a deep learning model to output the trained deep learning model, wherein the auxiliary tasks comprise an image text contrast task, an image-guided text generation task and an image text matching task.
3. The dynamic resolution visual language model training method according to claim 2, wherein the dynamic resolution image extraction layer comprises: The image detection unit is used for detecting and dividing each semantic entity in the document image to obtain an original region diagram, boundary frame coordinates, entity types and visual complexity scores corresponding to each semantic entity; The image dividing unit is used for dividing each original region image into two types of entity regions and non-key regions based on the entity types and the visual complexity scores, determining resolutions corresponding to the key regions and the non-key regions respectively, and enhancing each original region image belonging to the non-key regions; And the image fusion unit is used for merging the original area diagrams belonging to the same logic based on preset logic to obtain a combined area diagram, and constructing a visual representation structure diagram stored in a diagram structure.
4. The dynamic resolution visual language model training method of claim 2, wherein the visual feature refinement layer comprises: the visual feature coding unit is used for extracting the feature vectors corresponding to the visual region graphs respectively and reserving the position information of the visual region graphs to obtain a feature graph sequence; The regional feature fusion unit is used for aligning the combined regional visual feature vector and the entity regional visual feature vector by adopting a convolution network based on the feature map sequence and the boundary frame coordinates and taking the global map visual feature vector as a reference, and fusing the global map visual feature vector and the aligned combined regional visual feature and entity regional visual feature by utilizing an attention mechanism to obtain a document visual feature vector; And the visual characteristic enhancement unit is used for updating the characteristic vector of each node and the characteristic vector of the adjacent node in the visual representation structure diagram based on the visual representation structure diagram so as to enhance the document visual characteristic vector and obtain an image characteristic vector.
5. The dynamic resolution visual language model training method according to claim 1, further comprising, before said training a preset deep learning model based on each of said query text feature vectors and each of the document images in said multimodal document dataset: Optimizing a dynamic resolution image extraction layer of the deep learning model by adopting weighted cross entropy loss and regression loss; and training a visual feature refining layer of the deep learning model on the basis of a preset image text and reserving model parameters.
6. The dynamic resolution visual language model training method of claim 1, further comprising, prior to said acquiring the multimodal document dataset and the plurality of query text feature vectors: collecting the documents of all document types, and storing all the documents as document images with preset resolution to construct a multi-mode document data set; constructing a document retrieval data set, wherein the document retrieval data set comprises query text and a document page corresponding to the query text; And segmenting and encoding the query text to obtain the query text feature vector.
7. A document retrieval method based on a dynamic resolution visual language model, the method comprising: acquiring a query text and a document image library, segmenting words and encoding the query text to obtain a query text feature vector; extracting visual features of each document page in the document image library by using a dynamic resolution visual language model to construct a hierarchical index; Coarse searching is carried out on the document image library by utilizing a dynamic resolution visual language model based on the query text feature vector and the hierarchical index to obtain each document page, wherein the dynamic resolution visual language model is trained by the dynamic resolution visual language model training method according to any one of claims 1 to 6; And comprehensively scoring each document page by combining the multidimensional score to generate a sequencing result, wherein the multidimensional score comprises visual similarity, text semantic matching degree and layout structure consistency score.
8. The method for retrieving documents based on dynamic resolution visual language model as claimed in claim 7, further comprising, after said obtaining each document page: Inputting the document images and the query text corresponding to the document pages into a multi-mode large language model, generating natural language results, and highlighting the areas of the results in the document.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the dynamic resolution visual language model training method according to any one of claims 1 to 6 and/or implements the dynamic resolution visual language model based document retrieval method according to any one of claims 7 to 8 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the dynamic resolution visual language model training method according to any one of claims 1 to 6 and/or implements the dynamic resolution visual language model-based document retrieval method according to any one of claims 7 to 8.

Description

Dynamic resolution visual language model training method, detection method and equipment Technical Field The application relates to the field of artificial intelligence and multi-mode information retrieval, in particular to a dynamic resolution visual language model training method, a detection method and equipment. Background With the deep advancement of digital transformation, document processing gradually evolves from a single text form to a multi-modal form including images, tables, charts, formulas, and the like. The traditional document retrieval system is mainly based on an optical character recognition technology to extract text content and then perform text-based retrieval, but when the optical character recognition technology processes complex formats, multi-column layouts, handwriting, inclined text or low-resolution images, the accuracy of recognition is reduced and the visual structure and semantic association of the document cannot be reserved. In addition, because non-text elements such as charts, flowcharts, mathematical formulas and the like often bear important information, the information is lost only by text retrieval, and the integrity and the accuracy of a retrieval result are further affected. In recent years, a document retrieval method based on a visual language model is gradually rising, however, the existing method mostly adopts a fixed resolution input mode, and it is difficult to simultaneously reserve global layout information and local detail features under the condition of limited computing resources. Especially for documents containing dense tables, small-size texts and complex charts, the details are fuzzy if a fixed low-resolution input mode is adopted, the capturing of key information by a model is affected, and the calculation burden is obviously increased and the retrieval efficiency is reduced if a fixed high-resolution input mode is adopted. In addition, existing cross-modal alignment methods are mostly based on global feature matching, and lack of modeling of spatial and semantic relationships between entities (such as titles, paragraphs, charts and tables) inside a document results in a model that is limited in terms of performance when understanding complex document structures. Therefore, how to realize multi-granularity and high-fidelity document visual feature extraction under the limited computing resources and establish an effective image text alignment mechanism becomes a key challenge for improving the retrieval performance of multi-mode documents. Disclosure of Invention In view of the foregoing, embodiments of the present application provide a dynamic resolution visual language model training method, a dynamic resolution visual language model detection method, and a dynamic resolution visual language model training device, so as to eliminate or improve one or more drawbacks of the prior art. A first aspect of the present application provides a dynamic resolution visual language model training method, the method comprising: acquiring a multi-mode document dataset and a plurality of query text feature vectors; Training a preset deep learning model based on each query text feature vector and each document image in the multi-mode document dataset, so that the deep learning model detects each document image to generate a plurality of visual area diagrams with different resolutions corresponding to each document image respectively in the training process, and the deep learning model extracts and fuses each visual area diagram to obtain an image feature vector, so that the deep learning model calculates target loss based on the image feature vector and the corresponding query text feature vector to update model parameters of the deep learning model, and the trained deep learning model is obtained; Using the deep learning model as a dynamic resolution visual language model for searching document pages corresponding to the query text feature vectors from a plurality of document images In some embodiments of the application, the deep learning model comprises: The dynamic resolution image extraction layer is used for detecting and dividing each semantic entity in the document image so as to output a visual representation structure diagram corresponding to the document image, each visual region diagram and boundary frame coordinates corresponding to each semantic entity, wherein the visual representation structure diagram consists of nodes and edges between the nodes, the nodes are each visual region diagram, the edges are space adjacent relations, semantic correlation relations or logic membership relations between each visual region, and the visual region diagram comprises a global diagram, a combined region diagram and a solid region diagram; The visual feature refining layer is used for extracting the feature vectors corresponding to the visual region graphs respectively, aligning and fusing the feature vectors corresponding to the visual region graphs respectively