CN-122021824-A - Text knowledge constraint multi-mode information extraction method and system

CN122021824ACN 122021824 ACN122021824 ACN 122021824ACN-122021824-A

Abstract

The invention relates to the technical field of multi-mode information processing, and discloses a multi-mode information extraction method and system constrained by text knowledge. The method comprises the steps of carrying out multi-modal information segmentation on a chemical document to be processed to obtain text information and chart information, processing the text information to generate a text knowledge graph containing a plurality of triples, carrying out information extraction on the chart information based on constraint of the text knowledge graph to obtain a primary structured extraction result in a triples form, retrieving related knowledge subgraphs from the text knowledge graph based on the primary structured extraction result, carrying out checksum complementation on the primary structured extraction result to generate a chart extraction result, fusing the chart extraction result with the text knowledge graph, and updating to obtain the document-level multi-modal knowledge graph. The invention effectively overcomes the defects of the prior art in the aspect of multi-mode information extraction of chemical literature, and provides a solving path for constructing a high-quality and interpretable chemical knowledge graph.

Inventors

WU LE
YANG FAN
ZHANG KUN

Assignees

合肥工业大学

Dates

Publication Date: 20260512
Application Date: 20260112

Claims (9)

1. A text knowledge constrained multimodal information extraction method, comprising: Carrying out multi-mode information segmentation on the chemical literature to be processed to obtain text information and chart information; processing the text information to generate a text knowledge graph containing a plurality of triples; Based on the constraint of the text knowledge graph, extracting information from the graph information, wherein the information comprises classifying the graph information to determine a graph type, calling a corresponding intelligent agent tool according to the graph type to obtain an explicit extraction result of the graph information, inputting the explicit extraction result, the graph information and a corresponding basic description text into a multi-modal large model to obtain a primary structured extraction result in a triplet form, and searching related knowledge subgraphs from the text knowledge graph based on the primary structured extraction result; verifying and complementing the primary structured extraction result by combining the knowledge subgraph to generate a chart extraction result; And fusing the graph extraction result with the text knowledge graph, and updating to obtain the document-level multi-mode knowledge graph.
2. The method for extracting multi-modal information constrained by text knowledge according to claim 1, wherein the processing the text information to generate a text knowledge graph including a plurality of triples specifically includes: Extracting the text information by using a large language model to obtain a literature subject, performing shingled segmentation on the text information based on the literature subject, performing relevance assessment on the obtained paragraphs, retaining strongly relevant paragraphs with relevance higher than a threshold value with respect to the literature subject, extracting structural information of the strongly relevant paragraphs by using the large language model, generating the triples and constructing the text knowledge graph.
3. The text knowledge constrained multimodal information retrieval method of claim 1, wherein said triples include entities, relationships and entity attributes to express core facts and logical relationships in chemical literature.
4. The text knowledge constrained multimodal information extraction method of claim 1, wherein the combining knowledge subgraph performs checksum completion on the preliminary structured extraction result to generate a graph extraction result, and specifically comprises: The method comprises the steps of carrying out consistency check on entities in a primary structured extraction result and entities in a knowledge subgraph, carrying out relation rationality check on relations in the primary structured extraction result and relations in the knowledge subgraph, retrieving information from the knowledge subgraph when the semantics of the primary structured extraction result are incomplete, and supplementing the semantics of the primary structured extraction result.
5. The text knowledge constrained multimodal information extraction method of claim 1 further comprising the step of conflict handling: When the information of the text knowledge graph conflicts with the primary structured extraction result, the conflict information is introduced to serve as a context to trigger secondary extraction of the corresponding modal information, verification is conducted again based on the secondary extraction result, and confidence labels are marked for each triplet.
6. The text knowledge-constrained multimodal information extraction method of claim 1, wherein the fusing of the graph extraction result and the text knowledge graph, updating, and obtaining a document-level multimodal knowledge graph, specifically comprises: each triplet in the text knowledge graph is provided with a confidence coefficient label for indicating the source and consistency state of the triplet information; The method comprises the steps of carrying out alignment mapping on a graph extraction result and entities and relations of a text knowledge graph, writing the corresponding entities or relations into the document-level multi-mode knowledge graph by adopting a grading strategy according to confidence level labels of corresponding triples when the entities or relations in the graph extraction result have corresponding knowledge nodes or edges in the text knowledge graph, updating weights for the corresponding knowledge nodes based on the confidence level labels of the corresponding triples, and temporarily storing new entities or new relations serving as candidate knowledge nodes or candidate relations when uncovered new entities or new relations in the text knowledge graph appear in the graph extraction result, and attaching the confidence level labels.
7. The text knowledge-constrained multimodal information extraction method of claim 6, wherein the writing of the corresponding entity or relationship to the document-level multimodal knowledge graph by using a hierarchical policy according to the confidence label of the corresponding triplet, specifically comprises: the confidence label of the triplet comprises text information, chart information, text and chart consistency verification and text and chart inconsistency verification; For each entity or relation to be written, executing differentiated warehousing processing rules according to the confidence coefficient labels of the corresponding triples: If the triplet corresponding to the entity or the relation is derived from text information or from consistency verification of the text and the chart, the entity or the relation is directly written into a literature-level knowledge graph; if the corresponding triples of the entities or the relations are derived from chart information, adding a confidence coefficient label and an extraction time identifier to the entities or the relations when the entities or the relations are directly written into a literature-level knowledge graph; If the corresponding triples of the entity or the relation originate from the text and the chart inconsistency verification, the entity or the relation is reserved and stored in a form of conflict nodes or conflict relations, and the conflict state and the modal source are clearly marked.
8. The text-knowledge-constrained multimodal information retrieval method of claim 1, wherein the graph types include at least one of quantitative statistics, chemical structure and reaction, characterization and analysis, and wherein the agent tools include at least one of optical character recognition, object detection, numerical analysis, chemical structure analysis, reaction analysis, and spectral analysis.
9. A computer system comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.

Description

Text knowledge constraint multi-mode information extraction method and system Technical Field The invention relates to the technical field of multi-mode information processing, in particular to a multi-mode information extraction method and system constrained by text knowledge, which are suitable for automatic construction tasks of structured knowledge in complex chemical scientific research literature. Background With the continuous and intensive research of chemistry, the chemical literature contains increasingly diverse information forms, and besides the traditional continuous text description, a large number of chemical information forms such as a reaction schematic diagram, a structural diagram, a statistical chart, a spectrum image, a microscopic image and the like are also included. The charts and texts bear key information such as experimental design, reaction conditions, performance comparison, scientific conclusion and the like together, and form an important component of a chemical literature knowledge system. However, since chart information is often highly specialized, implicitly semantically rich, and expressed in a non-uniform manner, it presents a significant challenge to automated information extraction. When the existing information extraction technology processes chemical literature, texts and charts are generally regarded as parallel information sources, feature coding and semantic modeling are respectively carried out, or end-to-end generation is carried out on the combined input of the pictures and texts by directly relying on a multi-mode large model. The method has certain effectiveness in a general document scene, but is difficult to obtain stable and reliable effects in the chemical field. On one hand, the chemical chart contains a large number of symbolic expressions, professional abbreviations and implicit experimental assumptions, and the scientific meaning of the chemical chart is difficult to accurately understand only by the image, on the other hand, key information corresponding to the chart is often dispersed in analytical characters in different sections of the whole text and is limited by format structures, and the chart and a core interpretation text of the chart are not adjacent in space, so that the traditional image-text alignment method is difficult to acquire complete context semantics. In addition, the types of charts in chemical literature are highly diverse, and there are essential differences in information structure and semantic targets for different types of charts. For example, the statistical class diagrams emphasize variable relationships and trend analysis, the reaction diagram emphasizes the material conversion path, and the spectrograms are used for structure verification and characteristic peak analysis. If a unified information extraction strategy is adopted, the expression characteristics and the information requirements of different charts are difficult to consider, and information loss or semantic misjudgment is easy to cause. Therefore, how to realize the fine understanding and the structured information extraction of the multi-type chemical charts on the basis of fully utilizing the full text knowledge and construct unified, complete and logically consistent document-level knowledge representation becomes a key technical problem to be solved in the current intelligent chemical document processing field. Aiming at the problem of document multi-mode information extraction, the prior art mainly focuses on the following schemes. The first type of scheme is a graph information extraction method based on image analysis, and the method generally uses optical character recognition, object detection or image segmentation technology to directly identify text, numerical values and graphic elements from a graph and convert the text, numerical values and graphic elements into structured data. The method has a certain effect when processing simple statistical charts, but due to lack of understanding of the semantic background of the chart, complex experimental design or chemical meaning is difficult to correctly explain, and information which is not explicitly presented by the chart cannot be made up. The second type of scheme relies on the end-to-end capability of the multi-mode large model, takes the chart and the text as joint input, and directly generates extraction results. The method has strong generalization capability in a general multi-mode task, but in the field of high specialty such as chemistry, a model is difficult to accurately understand complex symbols, reaction mechanisms and technical terms, the uncertainty of a generated result is high, and controllability and interpretability are lacking. The third type of scheme adopts an image-text alignment strategy to respectively encode the chart and the corresponding explanatory words, maps the chart and the corresponding explanatory words to a unified semantic space and completes information generati