CN-122019790-A - Intelligent reading and structured data generation system for chemical literature

CN122019790ACN 122019790 ACN122019790 ACN 122019790ACN-122019790-A

Abstract

The invention discloses an intelligent reading and structured data generation system for chemical literature, and relates to the technical field of data processing. The intelligent reading and structured data generating system for the chemical literature comprises a literature resource acquisition module, a literature preprocessing module, a chemical experiment information extraction module and a chemical and material database construction module, wherein five modules are integrated in the process of literature resource acquisition, preprocessing, experiment information extraction, structured data generation and database construction to form a full-flow automatic processing system, so that end-to-end conversion from an original literature to a high-quality database is realized, manual intervention is greatly reduced, the processing efficiency of the chemical literature is improved, standardized and high-credibility data support is provided for intelligent research and development in the chemical and material field, and the problems of difficulty in screening invalid literature, non-uniform format, low experimental information extraction precision, insufficient data structuring degree and field adaptation database deletion in the chemical literature processing are solved.

Inventors

CHEN XIANG
LI CHENG
ZHOU JUNLEI

Assignees

北京机数小来智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260131

Claims (10)

1. The intelligent reading and structured data generating system for chemical literature is characterized by comprising: the document resource acquisition module is used for screening the acquired chemical document resources, removing invalid documents without chemical experiment information and forming a multi-type original chemical document collection; The document preprocessing module is used for carrying out format standardization and semantic annotation on the multi-type original chemical document collection to generate a resolvable chemical document text; The chemical experiment information extraction module is used for extracting chemical experiment information and association relations among the information based on the information extraction model and generating unstructured chemical experiment information and association relation data sets; the structured data generation module is used for mapping unstructured information in unstructured chemical experiment information to corresponding structured fields one by one through a field matching algorithm to generate a standardized chemical structured data set; And the chemical and material database construction module is used for carrying out data validity check redundancy data elimination on the standardized chemical structured data set, and carrying out field specificity screening to generate a chemical and material database.
2. The intelligent reading and structuring data generating system for chemical literature according to claim 1, wherein the process of screening the collected chemical literature resources to remove invalid literature without chemical experiment information and forming a multi-type original chemical literature collection is as follows: Receiving professional corpus and literature structure analysis data in the chemical and material field, extracting feature words, and integrating the feature words to form a chemical experiment information feature word stock, wherein the feature words comprise chemical experiment entity words, chemical experiment action words and chemical experiment parameter words; Based on the structural characteristics of the literature, setting a literature screening rule base by combining the characteristic word base of the chemical experiment information; Based on web crawler technology and document database interface calling technology, chemical document resources in the chemical and material fields are collected directionally; matching and screening a document screening rule base and a chemical experiment information feature word base on chemical document resources to obtain a document set to be finely screened and a preliminary invalid document cache pool; Converting the full text input into the literature in the literature collection to be refined into text vectors, inputting a semantic analysis model optimized by a pre-training BERT model, classifying and judging the literature based on a chemical experiment information feature word stock, and outputting a result containing effective experiment information or without effective experiment information; If the result is that the effective experimental information is contained, the result is included in a document set to be verified, and the document set to be verified and the accurate invalid document cache pool are output; Acquiring a manual review result of a document in a document set to be verified based on a chemical experiment information feature word stock, marking as a review invalid document to be merged into a review invalid document cache pool if a review finding model misjudges, merging with other documents containing valid experiment information if the review confirms that the document set contains valid experiment information, and outputting a valid original document set and the review invalid document cache pool; deleting all invalid documents in the preliminary invalid document cache pool, the accurate invalid document cache pool and the rechecking invalid document cache pool, classifying and sorting the valid original document sets according to document types to form a multi-type original chemical document set.
3. The intelligent reading and structured data generating system for chemical literature according to claim 2, wherein the process of performing matching screening on the chemical literature resources by using a literature screening rule base and a chemical experiment information feature word base to obtain a to-be-fine screened literature collection and a preliminary invalid literature cache pool is as follows: if the title or abstract of the chemical literature resource contains non-experimental keywords or the chapter title has no forward mark, the title or abstract is directly marked as a preliminary invalid literature; For unlabeled documents, calling chemical experiment entity words and chemical experiment action words in a chemical experiment information feature word library, and counting the total occurrence frequency of the chemical experiment entity words and the chemical experiment action words in the titles and abstracts of the unlabeled documents; If the total frequency is smaller than the set frequency, marking the primary invalid document and temporarily storing the primary invalid document into a primary invalid document cache pool; The unlabeled documents are included in the document collection to be refined, and the document collection to be refined is output.
4. The intelligent reading and structured data generation system of chemical literature according to claim 1, wherein the process of formatting and semantically labeling the collection of multi-type original chemical literature to generate the resolvable chemical literature text is as follows: Performing progressive scanning and redundant mark recognition on each original document in the multi-type original chemical document set through a preset redundant format mark recognition rule, and removing matched redundant format contents; The method comprises the steps of processing noise-reduced documents in an original chemical document set after noise reduction by a PDF text extraction engine based on Apache PDFBox and a DOC analysis component of Apache POI, analyzing a document object model by the PDF text extraction engine, extracting text layer content and removing format rendering residues for PDF format documents, reading a document paragraph structure by the DOC analysis component for DOC format documents, avoiding interference of macro commands and format style codes on text extraction, uniformly converting texts extracted by the two format documents into a pure text format of UTF-8 codes, and outputting a standardized format pure text document set; The method comprises the steps of carrying out proper noun positioning in the chemical field on plain texts in a plain text document set in a standardized format based on a semantic annotation tool, adding semantic identifications in a labeling way, carrying out supplementary annotation on vocabularies which are not completely matched with a feature word library but are semantically associated through similarity threshold judgment, and outputting a preliminary semantic annotation chemical document text set; And (3) performing marking consistency verification on the preliminary semantic marking chemical document text set, correcting marking errors found by verification, and integrating to form an analyzable chemical document text after the correction is completed.
5. The intelligent reading and structured data generation system of claim 1, wherein the information extraction model comprises a lower CRF algorithm and an upper fine-tuning BERT model, wherein: The bottom CRF algorithm is responsible for learning the context sequence dependency relationship of the vocabulary in the chemical field in the text and outputting the preliminary sequence positioning result of the chemical information in the text; The upper fine tuning BERT model takes the bottom sequence positioning result and domain vocabulary in the chemical experiment information feature word stock as joint input features, strengthens the recognition weight of chemical proper nouns through a concentration mechanism, and corrects ambiguity in the primary sequence positioning result.
6. The intelligent reading and structured data generating system for chemical literature according to claim 5, wherein the process of generating unstructured chemical experiment information and association relation data sets based on the information extraction model to extract the association relation between the chemical experiment information and each information is as follows: calling a lower-layer CRF algorithm, carrying out sequence positioning on words with semantic labels in texts, and outputting entity candidate sequences, calling an upper-layer fine-tuning BERT model, and carrying out type judgment on the entity candidate sequences by combining chemical experiment entity words in a chemical experiment information feature word library to generate a chemical experiment entity subset; calling an upper fine tuning BERT model, identifying descriptive phrases associated with entities in a text, and splitting and extracting characteristic and property names, corresponding values and units by combining characteristic and property keywords and value association rules in a chemical experiment information feature word stock to respectively generate a functional characteristic subset and a physicochemical characteristic subset; Calling a model bottom CRF algorithm to locate an action sequence in a text, calling an upper fine tuning BERT model, and combining action words in a chemical experiment information feature word library to confirm action types and related entities to generate an experiment action subset; meanwhile, combining field association rules in a chemical experiment information feature word stock, screening high-frequency effective association combinations to form an association list, marking the text source position of each relation, and generating a chemical experiment information association subset; And combining the chemical experiment entity subset, the functional characteristic subset, the physicochemical property subset, the experiment action subset and the chemical experiment information association relation subset into a complete record according to the logical association of the entity-characteristic and the property-action-association relation, and generating unstructured chemical experiment information and an association relation data set.
7. The intelligent reading and structured data generating system for chemical literature according to claim 1, wherein the process of mapping unstructured information in unstructured chemical experiment information to corresponding structured fields one by one through a field matching algorithm to generate a standardized chemical structured data set is as follows: Defining structured fields and subfields, and outputting chemical data structuring specification, constructing a structured field dictionary, and defining mapping relation between unstructured data keywords and structured fields; The method comprises the steps of calling mapping relations in a structured field dictionary, positioning structured fields and subfields corresponding to each piece of information in unstructured information, filling keywords into the structured fields according to the mapping relations in the dictionary for association relation information, and generating a structured data embryonic form which accords with preliminary matching after finishing item-by-item matching; Performing standardized processing on the preliminarily matched structured data formats according to format standards of all structured fields in the chemical data structuring specification, and outputting standardized structured data with unified formats; after carrying out integrity check and logic check on the standardized structured data with uniform format, outputting chemical structured data after check and correction; and integrating the chemical structural data after verification and correction, and outputting a standardized chemical structural data set.
8. The intelligent reading and structured data generation system of claim 7, wherein the process of performing a normalization process on the preliminary matched structured data formats according to format standards of structured fields in the chemical data structuring specification comprises dose parameter field normalization, reaction condition field normalization, physical index field normalization and substance name field normalization.
9. The intelligent reading and structured data generating system for chemical literature according to claim 1, wherein the process of performing data validity check redundant data elimination on the standardized chemical structured data set and performing domain-specific screening to generate the chemical and material database is as follows: Performing contradiction check on dosage parameters and reaction logics and reasonable range check on physical indexes on the standardized chemical structured data set, and outputting a chemical structured data set after validity check; redundant data elimination is carried out on the chemically structured data set after the validity verification, and unit unification is carried out, so that a chemically structured data set with consistent units is obtained; performing field screening on the unit consistent chemical structured data set to obtain a field-adaptive chemical structured data set; a chemical and materials database is established based on domain-wise adaptation of the chemical structured dataset.
10. The intelligent reading and structured data generating system for chemical literature according to claim 9, wherein the process of creating the chemical and material database based on the domain-by-domain adaptive chemical structured data set is as follows: Establishing a domain-division data table and a shared association table based on the domain-division adaptation chemical structured data set and the entity-characteristic-association relationship logic of structured fields of all domain subsets; Taking the table structures of the domain-dividing data table and the sharing association table as field mapping basis, importing each data of each domain subset in the domain-dividing adaptive chemical structured data set into the corresponding domain-dividing data table according to the one-to-one correspondence rule of the subset field-data table field, and outputting a preliminary warehouse-in domain-dividing data table set; based on the query requirements of each field, establishing an exclusive B+ tree index for the fields corresponding to the sub-field data tables, simultaneously establishing a main key index for the material identification fields of all the sub-field data tables, establishing a common index for the association type coding fields to ensure the cross-table association query efficiency, and outputting a sub-field data table set with the exclusive index; verifying the correlation integrity and field data validity of each sub-domain data table and the shared correlation table, and correcting the abnormal data found by verification and backtracking to the sub-domain adaptation chemical structured data set; Integrating the checked domain-separated data table set with the exclusive index according to the domain classification, and outputting a chemical and material database.

Description

Intelligent reading and structured data generation system for chemical literature Technical Field The invention relates to the technical field of data processing, in particular to an intelligent reading and structured data generation system for chemical literature. Background Intelligent processing and structured data generation of chemical literature are the core links of data-driven research and development in the fields of supporting chemistry and materials. In an actual scientific research scene, chemical literature has the characteristics of source isomerization, expression specialization and information multimodality, and is influenced by factors such as field term ambiguity, record irregularity and the like, and the following defects are exposed in the prior art: Firstly, the prior art relies on keyword accurate matching to perform document screening, the problems of term synonyms, polysemous words and context ambiguity cannot be processed, and short documents are prone to semantic similarity misjudgment caused by dimension disasters, so that invalid documents are not screened or related documents are not screened, and the invalid documents are difficult to screen. Typesetting specifications, compound naming and symbolic expression differences of documents from different sources are obvious, and a general preprocessing tool cannot analyze chemical charts, tables and proprietary formats, so that the unification difficulty of document formats is high. Secondly, the NLP model lacks field pre-training, is difficult to handle contextual ambiguity of terms such as activity, and has insufficient resolving power on complex chart information and response entity association, so that the experimental information extraction accuracy is low, and even the problems of response lack, entity misidentification and the like occur. In addition, the prior art lacks a unified format definition and a default data completion mechanism for information such as dosage, reaction conditions and the like, so that the data structuring degree is insufficient, and a standard data set is difficult to form. Therefore, a technology for intelligent processing of chemical literature integrating semantic screening, proprietary format processing, domain accurate extraction, standardized structuring and domain-by-domain adaptation is needed to solve the technical bottleneck and improve the accuracy and practicability of chemical data processing. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a chemical literature intelligent reading and structured data generating system, which solves the problems of difficult invalid literature screening, non-uniform format, low experimental information extraction precision, insufficient data structuring degree and field adaptation database deletion in chemical literature processing. In order to achieve the purpose, the intelligent reading and structuring data generating system for the chemical literature comprises the following technical scheme: And the document resource acquisition module is used for screening the acquired chemical document resources and removing invalid documents without chemical experiment information to form a multi-type original chemical document collection. And the document preprocessing module is used for carrying out format standardization and semantic annotation on the multi-type original chemical document collection to generate an resolvable chemical document text. And the chemical experiment information extraction module is used for extracting chemical experiment information and the association relation among the information based on the information extraction model and generating unstructured chemical experiment information and an association relation data set. And the structured data generation module is used for mapping unstructured information in unstructured chemical experiment information to corresponding structured fields one by one through a field matching algorithm to generate a standardized chemical structured data set. And the chemical and material database construction module is used for carrying out data validity check redundancy data elimination on the standardized chemical structured data set, and carrying out field specificity screening to generate a chemical and material database. The invention has the following beneficial effects: According to the invention, five modules are integrated, namely document resource acquisition, pretreatment, experimental information extraction, structured data generation and database construction are performed, so that a full-flow automatic processing system is formed, the end-to-end conversion from an original document to a high-quality database is realized, the manual intervention is greatly reduced, the chemical document processing efficiency is improved, the standardized and high-reliability data support is provided for intelligent research and development in the chemical and material fie