CN-121982461-A - Scene information extraction method and system based on multi-mode large model

CN121982461ACN 121982461 ACN121982461 ACN 121982461ACN-121982461-A

Abstract

The invention relates to the technical field of information extraction, and discloses an information extraction method and system based on a multi-mode large model scene, wherein the method comprises the following steps: the initial feature vector is generated by combining the optical character recognition with the original layout data, and the text and the image features are fused by the attention mechanism to form an intermediate representation which is adaptive across fields. Aiming at insufficient handwriting printing distinction, a neural network is adopted to enhance local details, refining characteristics are generated, relationships among nodes are propagated through a graph reasoning model, consistency is normalized and output, and finally a precise key value pair is generated. The method can realize the efficient fusion and cross-domain adaptation of multi-mode information, improve the accuracy and consistency of document processing, and intelligently process diversified documents.

Inventors

LIU YUN
SHEN HAOYANG
CHEN KAIXING
WANG QIANSHENG
ZHENG BOWEI

Assignees

交通银行股份有限公司深圳分行

Dates

Publication Date: 20260505
Application Date: 20251226

Claims (8)

1. A method for extracting general scene information based on a multi-mode large model, which is characterized by being executed by a computer and comprising the following steps: acquiring an initial feature vector of the document image after the multi-mode fusion, determining related components in the initial feature vector, and fusing the related components into an attention mechanism to acquire a unified part of the text image; Fusing the unified part of the text image in the initial feature vector by adopting the attention mechanism to obtain intermediate representation of cross-domain adaptability; Local detail enhancement is carried out on the intermediate representation, the enhanced feature representation is obtained, the local features of the feature representation are weighted by adopting an attention mechanism, and a refined structural feature representation is obtained; Constructing a graph inference model based on the structural feature representation to propagate the relationship among nodes, extracting node attributes and generating accurate key value pairs; labeling the key value pairs to obtain final key value pair output, and performing cross-domain adaptive iterative optimization on the final key value pair output to obtain an iterative optimization result; and fusing the iterative optimization result and historical data of various document types by adopting a difference characteristic optimization method to obtain the reinforced universal frame assembly.
2. The method for extracting universal scene information based on a multi-mode large model according to claim 1, wherein the acquiring the initial feature vector after the multi-mode fusion of the document image, determining the relevant components in the initial feature vector, and fusing the relevant components into an attention mechanism, and obtaining the unified part of the text image comprises the following steps: extracting text elements from the document image by optical character recognition, and generating original layout data containing text contents based on the text elements and combining image pixel information; generating an initial feature vector which contains text and layout information after multi-mode fusion by adopting a multi-mode fusion method according to the text element set and the original layout data; based on the initial feature vector, extracting a text semantic feature set by adopting a convolutional neural network, and generating weighted semantic features through an attention mechanism based on the text semantic feature set; evaluating the typesetting complexity of the weighted semantic features by adopting a graph neural network, and determining that the initial feature vector contains a related component if the typesetting complexity score exceeds a preset complexity score threshold; based on the related components, the fusion proportion of the image-text characteristics is adaptively adjusted through an attention mechanism, and a unified part of the text image is generated.
3. The method for extracting universal scene information based on a multi-mode large model according to claim 1, wherein the merging the unified part of the text image in the initial feature vector by adopting an attention mechanism to obtain the intermediate representation of cross-domain adaptability comprises the following steps: Calculating the association weight of the text feature and the image feature by adopting a multi-head self-attention mechanism according to the initial feature vector; and carrying out weighted fusion on the text features and the image features based on the association weights, and introducing a domain self-adaptive regularization term to obtain the inter-domain adaptive intermediate representation.
4. The method for extracting general scene information based on a multi-mode large model according to claim 1, wherein the step of carrying out local detail enhancement based on the intermediate representation to obtain an enhanced feature representation, weighting the enhanced feature representation local feature by adopting an attention mechanism to obtain a refined language format representation comprises the following steps: Performing key field coverage enhancement based on the intermediate representation, and detecting the distinction between handwriting and printing content based on the enhanced intermediate representation; If the distinguishing degree does not reach the preset distinguishing degree standard, extracting local detail characteristics and fusing the local detail characteristics with the enhanced intermediate representation to obtain a language format refined representation; Based on the language format refined representation, extracting node relations and constructing a relation graph to acquire node representations, and reinforcing node connection through a graph neural network to generate a consistency feature set; Based on the consistency feature set, generating and outputting a refined structural feature representation through full connection layer mapping and contrast learning.
5. The method for extracting general scene information based on a multi-mode big model according to claim 1, wherein the steps of constructing a graph inference model based on the structured feature representation, propagating relationships between nodes, extracting node attributes, and generating accurate key value pairs include: based on the structural characteristics, adopting a preprocessing method to standardize a format, and acquiring preprocessed standardized initial data; constructing a graph inference model according to the standardized initial data, iterating and propagating node relations based on a relation propagation mechanism, and generating a propagation result; extracting node attributes based on the propagation result, and generating a primary key value pair set; and performing normalization judgment on the preliminary key value pair set, adjusting attribute output, and detecting and correcting false recognition residues to generate accurate key value pairs.
6. The method for extracting the universal scene information based on the multi-mode big model according to claim 1, wherein the labeling the key value pairs to obtain a final key value pair output, performing cross-domain adaptive iterative optimization on the final key value pair output to obtain an iterative optimization result, and the method comprises the following steps: preliminary labeling is carried out on the key value pair input data through a preset key value pair template, and an initial key value pair set is generated; if the initial key value pair set has information omission, supplementing a label missing field to generate an updated key value pair set; carrying out data fusion and boundary division optimization on the updated key value pair set to obtain a boundary optimization key value pair set; And carrying out classification false recognition correction and cross-domain adaptation optimization on the boundary optimization key value pair set, generating an adaptation key value pair set, optimizing through an iterative adjustment algorithm, and outputting an iterative optimization result.
7. The method for extracting the universal scene information based on the multi-mode large model according to claim 1, wherein the method for optimizing the difference features is used for fusing the iterative optimization result and the history data with various document types to obtain the enhanced universal frame component, and comprises the following steps: based on the iterative optimization result, generating an initial feature set by adopting a key value pair extraction algorithm; based on the initial feature set, merging historical data of various document types to generate a merged feature vector; Based on the fused feature vector, verifying cross-domain feature consistency by adopting a difference feature optimization method, and extracting structural features of the document by applying principal component analysis for enhancement to obtain a feature enhancement data set; and constructing the universal framework component with the reinforced universal framework component based on the characteristic reinforced data set.
8. A multi-modal large model-based general scene information extraction system, comprising: The data acquisition module is used for acquiring initial feature vectors of the document images after the multi-mode fusion, determining related components in the initial feature vectors, and integrating the related components into an attention mechanism to acquire a unified part of the text images; The feature fusion module is used for fusing the unified part of the text image in the initial feature vector by adopting the attention mechanism to obtain intermediate representation of cross-domain adaptability; the detail enhancement module is used for carrying out local detail enhancement on the intermediate representation to obtain an enhanced feature representation, weighting local features of the feature representation by adopting an attention mechanism to obtain a refined structural feature representation; the standard output module is used for constructing a graph inference model based on the structural feature representation to propagate the relationship among nodes, extracting node attributes and generating accurate key value pairs; the iteration optimization module is used for marking the key value pairs to obtain final key value pair output, and performing cross-domain adaptive iteration optimization on the final key value pair output to obtain an iteration optimization result; And the fusion optimization module is used for fusing the iterative optimization result and the historical data of various document types by adopting a difference characteristic optimization method to obtain the reinforced universal frame assembly.

Description

Scene information extraction method and system based on multi-mode large model Technical Field The invention relates to the technical field of information extraction, and discloses a scene information extraction method and system based on a multi-mode large model. Background At present, the key information extraction is widely applied to the voucher processing in various industries such as finance, medical treatment, logistics and the like, and along with the acceleration of digital transformation, the information extraction requirements of various vouchers such as invoices, contracts, medical reports and the like are increasingly growing, so that the key link for pushing intelligent application is realized. In one prior art approach to information extraction typically relies on models or rules tailored to a single voucher type, which is difficult to accommodate in varying information structures in the face of a voucher of varying formats and complex content. For example, a model for an invoice may not accurately identify key fields in a medical report, because of a large difference between semantics and typesetting, when multi-mode information is processed, multiple data sources such as text, images and the like are often difficult to effectively fuse, so that the accuracy of information extraction is reduced. Limitations are often faced when complex scenes are processed, and diversified business requirements are difficult to meet. The prior art has the defects of insufficient universality, is difficult to deal with the problem of complex scenes across fields, is difficult to comprehensively capture information by simply relying on text analysis or image processing, and is easy to miss or misidentify key fields. The unified key value pair format is lacking to standardize the output of different credential types, so that when the diversified credentials are processed, the output structured results are inconsistent, and meanwhile, the unified structured data is difficult to quickly adapt and output, so that the subsequent service cost is increased. Disclosure of Invention The invention provides a scene information extraction method and a scene information extraction system based on a multi-mode large model, which realize efficient fusion and cross-domain adaptation of multi-mode information, improve the accuracy and consistency of document processing and intelligently process diversified documents. In order to solve the above technical problems, the present invention provides a method for extracting scene information based on a multi-mode large model, including: acquiring an initial feature vector of the document image after the multi-mode fusion, determining related components in the initial feature vector, and fusing the related components into an attention mechanism to acquire a unified part of the text image; Fusing the unified part of the text image in the initial feature vector by adopting the attention mechanism to obtain intermediate representation of cross-domain adaptability; Local detail enhancement is carried out on the intermediate representation, the enhanced feature representation is obtained, the local features of the feature representation are weighted by adopting an attention mechanism, and a refined structural feature representation is obtained; Constructing a graph inference model based on the structural feature representation to propagate the relationship among nodes, extracting node attributes and generating accurate key value pairs; labeling the key value pairs to obtain final key value pair output, and performing cross-domain adaptive iterative optimization on the final key value pair output to obtain an iterative optimization result; and fusing the iterative optimization result and historical data of various document types by adopting a difference characteristic optimization method to obtain the reinforced universal frame assembly. In one implementation manner, the acquiring the initial feature vector after the document image multi-mode fusion, determining related components in the vector, and integrating the related components into an attention mechanism, and obtaining the unified part of the text image includes: extracting text elements from the document image by optical character recognition, and generating original layout data containing text contents based on the text elements and combining image pixel information; generating an initial feature vector which contains text and layout information after multi-mode fusion by adopting a multi-mode fusion method according to the text element set and the original layout data; based on the initial feature vector, extracting a text semantic feature set by adopting a convolutional neural network, and generating weighted semantic features through an attention mechanism based on the text semantic feature set; evaluating the typesetting complexity of the weighted semantic features by adopting a graph neural network, and determining that the i