CN-122019654-A - Biomedical literature scientific problem extraction system and extraction method

CN122019654ACN 122019654 ACN122019654 ACN 122019654ACN-122019654-A

Abstract

The application discloses a biomedical literature scientific problem extraction system and an extraction method, which belong to the field of biomedical literature data processing, wherein the extraction system comprises sequentially and cooperatively semantic perception and pretreatment, cascading logic screening, core element abstraction and extraction and structured output modules, the method comprises five steps of data access cleaning, cascading logic screening, core element extraction, logic examination reconstruction and structured encapsulation storage, precise screening is realized through research pattern recognition, macroscopic scientific problems are extracted through de-molecular reconstruction, bilingual term mapping is synchronously established, and finally ICD-11 standardized JSON format structured data is output. According to the method, irrelevant documents are filtered greatly, scientific research information acquisition efficiency is improved, scientific problem cross-domain multiplexing is achieved, high-quality data support is provided for biomedical knowledge graph construction and AI model training in the vertical domain, and the technical problems that the existing tool is low in retrieval accuracy, core scientific problems are extracted and distorted, and unstructured texts cannot be reused are solved.

Inventors

SUN YIWEI
GUO GAOXIAN
LIU JINGJING
LI HAILING

Assignees

解螺旋(上海)科技有限公司
上海尤里卡信息科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (10)

1. The biomedical literature scientific problem extraction system is characterized by comprising a semantic perception and preprocessing module, a cascade logic screening module, a core element abstraction and extraction module and a structured output module which are sequentially cooperated; The semantic perception and preprocessing module is used for accessing multi-source heterogeneous biomedical document data, extracting text fields of two dimensions of title and abstract abstracts for structural analysis, removing irrelevant non-semantic information, and unifying output formats into standardized semantic input data of a structured text containing title and abstract abstracts fields; The cascade logic screening module is used for executing 2-level semantic screening on standardized semantic input data based on semantic features, eliminating biomedical documents without documents in the biomedical field and non-basic research paradigms step by step, and outputting basic research document semantic data meeting requirements; The core element abstraction and extraction module is used for executing multidimensional information extraction and de-molecular logic reconstruction on basic research literature semantic data, extracting standardized disease entities and macroscopic core science problems, and establishing a mapping relation between Chinese science concepts and original English academic terms; The structured output module is used for carrying out aggregation association on standardized disease entities, macroscopic core science problems and Chinese and English term mapping relations through unique task_ids and pmid, packaging the mapping relations into flattened format data and completing storage of the flattened format data to a target database.
2. The biomedical literature science problem extraction system of claim 1, wherein the semantic perception and preprocessing module removes irrelevant non-semantic information including copyright statement, copyright symbol, reference citation mark, author profile, credit, formula symbol, special character, messy code and advertisement information; The 2-level semantic screening of the cascade logic screening module comprises field suitability judgment and research paradigm recognition, wherein the field suitability judgment identifies the type of a text core entity and a context through a semantic analysis model, performs semantic similarity calculation by combining a biomedical field ontology library, and accurately eliminates non-biomedical field documents; The research paradigm recognition accurately distinguishes core distinguishing features of three types of research paradigms of basic research, clinical research and belief analysis by analyzing the characteristics of a text logic structure and a research method, and rejects two types of non-basic research paradigm documents.
3. The biomedical literature scientific problem extraction system according to claim 1, wherein the core element abstraction and extraction module comprises an entity standardization subunit, a logic induction subunit and a de-molecular reconstruction subunit, wherein the technical modules adopt RabbitMQ message queues for asynchronous decoupling, data flow is unidirectionally transmitted by taking a JSON data packet as a carrier, and hardware deployment adopts a Kubernetes containerization architecture to support GPU acceleration; the entity standardization subunit calls a large language model API to map the identified disease name to an ICD-11 standard coding system through character string matching and semantic similarity calculation, sub-classification codes with deeper levels are preferentially selected when multiple codes are matched, when mapping fails, the original Chinese name is reserved, the ICD code is set as null to trigger manual review, and finally the separated standardization disease entity is generated; The logic induction subunit analyzes the text sentence pattern based on the dependency syntax analysis tree, extracts a 'subject-predicate-object' triplet so as to capture a core causal chain, and extracts a preliminary scientific conclusion at least comprising a 'molecule-function-phenotype' complete chain; The de-molecularly reconstructing subunit scans and filters microcosmic details such as molecular entities and the like in preliminary scientific conclusions by using a 4-category 50-odd negative constraint rule updated by medical professionals according to quarters, reserves a core causal chain comprising direct causal mapping, converts experimental observation indexes into general biological processes or cell function state descriptions by using a semantic mapping technology based on megalevel corpus instruction fine tuning, and reconstructs the experimental observation indexes into macroscopic core scientific problems.
4. The biomedical literature science problem extraction system according to claim 1, wherein the core element abstraction and extraction module reversely locates key english phrases in literature sources through an attention mechanism to realize accurate anchoring and structured mapping of chinese science concepts and original english academic terms; The structured output module packages the standardized structured data into flattened JSON format data which cannot contain nested objects, stores the data into a MongoDB target database through a RESTful API interface, and provides GraphQL interfaces to adapt to downstream multi-scene requirements.
5. A biomedical literature science problem extraction method, characterized by being applied to the biomedical literature science problem extraction system according to any one of claims 1 to 4, comprising the steps of: Step S1, data access and cleaning, namely receiving biomedical literature metadata streams to be processed, denoising literature titles and abstracts, removing irrelevant non-semantic information, and outputting standardized semantic input data containing title and abstracts fields; step S2, cascade logic screening, namely sequentially performing field suitability judgment and research pattern recognition on standardized semantic input data, removing biomedical documents without biomedical field documents and non-basic research patterns step by step, and outputting basic research document semantic data meeting the requirements; S3, extracting core elements, namely carrying out standardized extraction of disease entities and induction of original scientific conclusions on basic research literature semantic data in parallel, and outputting standardized disease entities and preliminary scientific conclusions; step S4, logic examination and reconstruction, namely performing de-molecular treatment on the preliminary scientific conclusion, stripping microscopic detail information, reconstructing the preliminary scientific conclusion into a macroscopic core scientific problem, and simultaneously establishing a precise mapping relation between a Chinese scientific concept and an original English academic term; And S5, carrying out primary key association based on pmid, aggregating and packaging the mapping relation of the standardized disease entity, the macroscopic core science problem and the Chinese and English terms into standardized structured data in a flattened format, and finishing warehousing and storage to a target database.
6. The method for extracting biomedical literature scientific problems according to claim 5, wherein in the step S2, the domain suitability determination is performed to identify the core entity type and the context of the text through a semantic analysis model, detect whether the biomedical entity ratio in the text reaches a set threshold, determine that the text is a non-biomedical domain literature and terminate the processing flow if the biomedical entity ratio does not reach the set threshold, the study pattern recognition is performed to determine that the clinical statistical feature is a clinical study literature by analyzing predicate verbs and study object features in the text, determine that the birth letter analysis feature is a belief analysis study literature if the clinical statistical feature is recognized, terminate the processing flow if the birth letter analysis feature is recognized, determine that the basic experimental feature is a basic study literature and enter the next step, and finally output basic study literature semantic data with a retention ratio of about 25% -30% after two-stage screening.
7. The method according to claim 5, wherein in the step S3, parallel processing is performed by a directed acyclic graph task scheduling mechanism, the disease entity standardized extraction is performed by calling a large language model to extract a disease name from basic research literature semantic data, mapping the disease name to an ICD-11 standard coding system by string matching and semantic similarity calculation to generate a standardized disease entity containing the disease name and a corresponding ICD-11 code, and the original scientific conclusion is summarized as a preliminary scientific conclusion containing a complete "molecular-functional-phenotypic" molecular mechanism based on a dependency syntax analysis tree.
8. The biomedical literature science problem extraction method according to claim 5, wherein in said step S4, the de-molecularly processing includes the sub-steps of: Step S41, scanning microcosmic detail information in a preliminary scientific conclusion through 4 major class 50 more negative constraint rules updated by medical professionals according to quarters, and marking tags such as [ REDUNDANT_ MOLECULE ] and the like as REDUNDANT features; step S42, filtering the redundant features, and reserving a core causal chain comprising direct causal mapping from 'biological state or structural change' to 'cell or tissue dysfunction'; S43, converting the filtered experimental observation index into a general biological process description or a cell function state description through a semantic mapping technology based on large language model instruction fine adjustment, and reconstructing the general biological process description or the cell function state description into a macroscopic core science problem with a pure Chinese statement sentence and a word number strictly defined in 80-100 words; The mapping relation between the Chinese scientific concept and the original English academic term is established through cosine similarity calculation after accurate anchoring is achieved by reversely positioning key English phrases in the document source through a concentration mechanism.
9. The method according to claim 5, wherein in the step S1, the biomedical document metadata stream contains only document title and abstract information, and the irrelevant non-semantic information removed by the denoising process includes copyright statement, copyright symbol, reference document citation mark, author profile, credit, formula symbol, special character, messy code and advertisement information.
10. The method for extracting biomedical literature scientific problems according to claim 5, wherein in the step S5, the standardized structured data is associated with a pmid primary key, packaged into a flattened and data type-constrained JSON format, the packaged content comprises standardized disease names and corresponding ICD-11 codes, 80-100 Chinese macro-core scientific problems, original english academic terms, task_id and publicish_year accurately mapped with Chinese scientific concepts, high concurrency warehouse storage of the standardized structured data in a mongo db database is completed through a RESTful interface, and the structural data after warehouse storage is used for front-end retrieval, biomedical knowledge graph construction or biomedical vertical field AI model training.

Description

Biomedical literature scientific problem extraction system and extraction method Technical Field The invention belongs to the field of biomedical literature data processing, and particularly relates to a biomedical literature scientific problem extraction system and an extraction method. Background Under the background of biomedical big data burst, scientific researchers face serious information overload challenges when subject design, front tracking or building a domain knowledge base, the existing document retrieval and analysis tools are difficult to meet high-precision scientific research demands due to technical defects, core technical pain points are reflected in three aspects, firstly, the semantic understanding capability of a retrieval system is insufficient, only Boolean logic is matched with keywords, semantic discrimination capability of a document 'research paradigm' is lacked, research types in the retrieval result are mixed, scientific researchers need to consume a large amount of time for manual discrimination, secondly, an automatic summarization technology is lacked in biomedical domain suitability, or excessive generalization is lost, key scientific discovery is omitted, or excessive detail covers up core biological logic, core scientific problems cannot be captured rapidly to support scientific hypothesis construction, thirdly, scientific documents exist in an unstructured natural language form, a standardized definition and an automatic extraction method are lacked, multiplexing structured knowledge is difficult to be converted, a 'data island' is formed, and downstream knowledge construction, cross-discipline discovery and vertical domain training are limited. At present, technologies for assisting scientific research literature reading and information analysis are mainly divided into three types, and all the technologies have obvious technical limitations: The traditional retrieval technology based on keywords relies on the query words input by users to carry out character string matching with document index items, has the inherent defects that synonyms cannot be matched and ambiguities cannot be distinguished, cannot understand the methodological characteristics of documents, and cannot distinguish different study paradigm documents under the same study subject; the extraction type automatic summarization technology calculates sentence weight through a statistical model and extracts key sentences from an original text to form a summary, the generated summary lacks semantic consistency and is easy to break, and cross-paragraph logic integration is difficult to carry out on key conclusions scattered in multiple paragraphs in biomedical documents; The generation type tool based on the general large language model utilizes the sequence generation capability of the large language model to summarize the whole text, has the problems of poor controllability, domain knowledge deviation, phantom risk and the like, has unstable output format, is difficult to meet the standardized requirement of database storage, and is easy to pile up terms to introduce noise information, and even kneading experimental data or causal relationship. As described above, the prior art lacks an automated system capable of deeply understanding biomedical research paradigms and performing "denoising" and "induction" according to specific academic standards, and cannot achieve accurate conversion from massive and disordered biomedical documents to high-value and structured scientific problem data, and there is a need for a document science problem extraction technology adapted to the biomedical field to solve the above problems. Disclosure of Invention In order to solve the technical problems that the existing document processing tool is low in retrieval accuracy, the core science problem is extracted and distorted, and unstructured texts cannot be converted into standardized structured knowledge, the application designs a biomedical document science problem extraction system and extraction method, realizes the accurate screening of biomedical documents and the structured extraction of the core science problem, improves the scientific research information acquisition efficiency, and provides high-quality structured data for knowledge graph construction and AI model training in the biomedical field. The biomedical literature scientific problem extraction system comprises a semantic perception and preprocessing module, a cascade logic screening module, a core element abstraction and extraction module and a structural output module which are sequentially cooperated; The semantic perception and preprocessing module is used for accessing multi-source heterogeneous biomedical document data (the data format is limited to XML or JSON format), extracting text fields of two dimensions of title and abstracts (abstract) for structural analysis, precisely removing irrelevant non-semantic information such as copyright statement, cop