CN-121998101-A - Knowledge graph enhancement-based medical field large language model training method and system

CN121998101ACN 121998101 ACN121998101 ACN 121998101ACN-121998101-A

Abstract

The invention relates to the technical field of machine learning, and discloses a training method and a training system for a large language model in the medical field based on knowledge graph enhancement, wherein the method comprises the steps of obtaining a medical corpus and a knowledge graph; the method comprises the steps of generating an entity sequence through fine granularity semantic analysis, extracting a multi-jump sub-graph through multi-dimensional association path reasoning, embedding the graph into a dynamic injection text based on dynamic weights to generate an enhanced training sample, analyzing and encoding the enhanced training sample into a standardized instruction set, and obtaining a target model through iterative optimization. According to the invention, through self-adaptive knowledge fusion and standardized instruction coding, the training efficiency and professional reasoning capacity of the medical large language model are improved.

Inventors

LIN QIFENG
SONG HUI
XIE MINGHUI
LIN CHENG
WANG XIANJUN

Assignees

福州中康智慧科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260408

Claims (10)

1. The large language model training method in the medical field based on the knowledge graph enhancement is characterized by comprising the following steps: A1. acquiring a medical corpus in the medical field and a knowledge graph in the medical field; A2. carrying out fine granularity semantic analysis on the original medical text records in the medical corpus to generate an unstructured entity sequence in the medical field; A3. Mapping the unstructured entity sequence into the medical field knowledge graph based on the pathological and semantic association edges of the medical field knowledge graph to perform multidimensional association path reasoning, and extracting a multi-jump sub-graph related to the original medical text record as a structured semantic graph subset of the medical field; A4. Dynamically injecting the structural semantic map subset to the corresponding entity position of the original medical text record in a map structure embedding mode to generate an enhanced training sample of the medical field; A5. Carrying out structural analysis on the enhanced training sample, and encoding the analyzed sample into a standardized training instruction set in the medical field; A6. and carrying out parameter iterative optimization on the initial language processing model in the medical field based on the standardized training instruction set to obtain a target medical language processing model in the medical field.
2. The knowledge-based enhanced medical domain large language model training method of claim 1, wherein the obtaining a medical corpus of medical domain and a knowledge-based of medical domain comprises: collecting original medical data from an authoritative medical data source, and performing data cleaning on the original medical data to obtain an initial corpus in the medical field; Performing chapter structure analysis and clinical entity identification on the data documents in the initial corpus, and performing semantic annotation on the analyzed data according to the identified clinical entity to construct a medical corpus in the medical field; Extracting concept nodes in a preset medical ontology library and clinical guidelines, and establishing hierarchical relations, attribute relations and logic reasoning rules among the concept nodes according to semantic association among the concept nodes so as to construct an initial knowledge graph of the medical field; and carrying out path pruning on the initial knowledge graph to obtain the knowledge graph of the medical field in the medical field.
3. The knowledge-graph-enhancement-based medical domain large language model training method of claim 1, wherein the performing fine-grained semantic parsing on the original medical text records in the medical corpus to generate the unstructured entity sequence of the medical domain comprises: performing sentence boundary recognition and clause segmentation on the original medical text records in the medical corpus to obtain text fragments of the original medical text records; Performing part-of-speech tagging on the text fragments, and identifying core entities and attribute entities in the text fragments; Performing dependency syntactic analysis on the core entity and the attribute entity to obtain a semantic modification relation of the text fragment; And based on the core entity, the attribute entity and the semantic modification relation, splicing according to the original sequence of the text fragments in the original medical text record, and generating an unstructured entity sequence in the medical field.
4. The knowledge-based enhanced medical domain large language model training method of claim 1, wherein mapping the unstructured entity sequence into the medical domain knowledge graph for multidimensional association path reasoning based on pathological-semantic association edges of the medical domain knowledge graph comprises: Performing semantic similarity matching on the entities in the unstructured entity sequence and the nodes in the medical field knowledge graph to generate a node mapping set of the medical field; Taking the nodes in the node mapping set as a starting point, and performing breadth-first traversal along the pathological and semantic association sides in the medical field knowledge graph to obtain a candidate path set in the medical field; based on the semantic relevance of the pathological semantic association side, carrying out multi-dimensional scoring on each path in the candidate path set, wherein the multi-dimensional scoring comprises pathological logic rationality, semantic continuity and diagnosis and treatment matching degree, and each dimensional weight is 0.4, 0.3 and 0.3 respectively; And screening the scored paths one by one according to a preset standard threshold value 0.7, and reserving paths with scores greater than or equal to the standard threshold value to obtain a screened path set in the medical field.
5. The knowledge-based enhanced medical domain large language model training method of claim 4, wherein the extracting the multi-hop subgraph related to the original medical text record as the structured semantic map subset of the medical domain comprises: Extracting nodes covered by the screened path concentrated paths, and constructing an initial subgraph of the medical field by combining the pathological and semantic association edges; Connectivity detection is carried out on the initial subgraph, and non-connected branches in the initial subgraph are fused according to semantic association among entities in the unstructured entity sequence, so that a connected subgraph of the medical field is obtained; and carrying out topological sorting on the connected subgraphs based on the entity appearance sequence, and packaging the sorted subgraphs into a graph data structure to serve as a structural semantic map subset of the medical field.
6. The knowledge-based enhanced medical domain large language model training method according to claim 1, wherein dynamically injecting the subset of structured semantic graphs into the corresponding entity location of the original medical text record in a graph structure embedding manner comprises: Performing graph neural network coding on nodes in the structured semantic graph spectrum subset to generate graph embedding vectors corresponding to the nodes; performing context coding on the entity in the original medical text record to obtain a context embedded vector corresponding to the entity; Calculating the dynamic injection weight of the entity according to the graph embedding vector and the context embedding vector, wherein the calculation formula of the dynamic injection weight is as follows: ; in the formula, Is the first The dynamic injection weights for each of the entities, As a function of the index of the values, A function is calculated for the vector similarity, Is the first The map corresponding to the individual entity embeds a vector, Is the first The map corresponding to the individual entity embeds a vector, Is the first The context embedding vector corresponding to the individual entity, Is the first The context embedding vector corresponding to the individual entity, Is a temperature coefficient which is preset and is equal to the temperature coefficient, A sum of all entities in the original medical text record; and carrying out weighted fusion on the graph embedded vector and the context embedded vector according to the dynamic injection weight to obtain the enhanced embedded vector in the medical field.
7. The knowledge-based enhanced medical domain large language model training method of claim 6, wherein the generating the enhanced training samples of the medical domain comprises: replacing the original embedded representation of the corresponding entity in the original medical text record with the enhanced embedded vector to obtain an entity enhanced text sequence in the medical field; performing position coding on the entity enhanced text sequence to obtain a position enhanced text sequence in the medical field; And carrying out structural recombination on the position enhanced text sequence according to the structure of the original medical text record to obtain the enhanced training sample in the medical field.
8. The knowledge-based enhanced medical domain large language model training method of claim 1, wherein the structurally parsing the enhanced training samples and encoding the parsed samples into a standardized training instruction set for the medical domain comprises: Performing text structure analysis on the enhanced training sample to obtain a structured component set of the enhanced training sample; Performing entity boundary recognition on the components in the structured component set to obtain key medical entities of the components; Semantic role labeling is carried out on the key medical entities, semantic relationships and functional roles of the key medical entities are determined, and the semantic relationships and the functional roles are integrated into fine-grained semantic labeling results of the enhanced training samples; according to a preset instruction template library, performing feature matching on the structured component set and the fine granularity semantic annotation result to obtain a target instruction template of the enhanced training sample; Carrying out structured filling on component contents in the structured component set according to placeholder positions in the target instruction template to obtain formatted instruction text of the enhanced training sample; and carrying out serialization coding on the formatted instruction text to obtain a standardized training instruction set in the medical field.
9. The knowledge-graph-enhancement-based large language model training method of the medical field of claim 1, wherein the performing parameter iterative optimization on the initial language processing model of the medical field based on the standardized training instruction set to obtain the target medical language processing model of the medical field comprises: Generating batched training data of the medical field according to the training batches of the standardized training instruction set; carrying out forward propagation on each instruction in the training batch by using an initial language processing model in the medical field to obtain a predicted output sequence of the training batch; performing difference measurement on the predicted output sequence and a corresponding tag sequence in the training batch to obtain a loss measurement value of the training batch; according to the loss measurement value, counter-propagating the trainable parameters in the initial language processing model to obtain gradient values of the trainable parameters; and based on the gradient value, carrying out iterative updating on the trainable parameters to obtain a target medical language processing model in the medical field.
10. The knowledge-graph-enhancement-based medical field large language model training system is characterized by being used for realizing the knowledge-graph-enhancement-based medical field large language model training method as set forth in claim 1, and comprises the following steps: the system comprises a data acquisition module, a semantic analysis module, a map construction module, an enhancement training module, an instruction coding module and a parameter optimization module, wherein: The output end of the data acquisition module is connected with the input end of the semantic analysis module and is used for outputting a medical corpus and a medical field knowledge graph; The output end of the semantic analysis module is connected with the input end of the map construction module and is used for outputting unstructured entity sequences; the output end of the map construction module is connected with the input end of the enhancement training module and is used for outputting a structured semantic map subset; The enhanced training module comprises: the image neural network coding unit is used for generating an image embedding vector; a context encoding unit for generating a context embedding vector; the dynamic weight calculation unit is used for calculating dynamic injection weights according to the vector similarity; The vector fusion unit is used for generating an enhanced embedded vector through weighted fusion; the output end of the instruction coding module is connected with the input end of the parameter optimization module and is used for outputting a standardized training instruction set; And the parameter optimization module is used for outputting the target medical language processing model.

Description

Knowledge graph enhancement-based medical field large language model training method and system Technical Field The invention relates to the technical field of machine learning, in particular to a training method and a training system for a large language model in the medical field based on knowledge graph enhancement. Background In the existing method, static knowledge injection is mostly adopted, and fusion weights are not dynamically adjusted according to the matching degree of text context and map semantics, so that irrelevant knowledge interference or key knowledge is lost; the graph subgraph extraction of the existing method only considers semantic relativity, and does not consider the time sequence logic of medical diagnosis and treatment, so that the node sequence of the subgraph is inconsistent with the clinical reasoning chain; The existing method lacks a standardized instruction template library of medical scenes, the training sample formats are not uniform, and the method is difficult to adapt to multi-task scenes such as medical question-answering, medical record generation, diagnosis reasoning and the like. Disclosure of Invention The invention provides a training method and a training system for a large language model in the medical field based on knowledge graph enhancement, which are used for solving the problems in the background technology. In order to achieve the above object, the method for training a large language model in the medical field based on knowledge graph enhancement provided by the invention comprises the following steps: A1. acquiring a medical corpus in the medical field and a knowledge graph in the medical field; A2. carrying out fine granularity semantic analysis on the original medical text records in the medical corpus to generate an unstructured entity sequence in the medical field; A3. Mapping the unstructured entity sequence into the medical field knowledge graph based on the pathological and semantic association edges of the medical field knowledge graph to perform multidimensional association path reasoning, and extracting a multi-jump sub-graph related to the original medical text record as a structured semantic graph subset of the medical field; A4. Dynamically injecting the structural semantic map subset to the corresponding entity position of the original medical text record in a map structure embedding mode to generate an enhanced training sample of the medical field; A5. Carrying out structural analysis on the enhanced training sample, and encoding the analyzed sample into a standardized training instruction set in the medical field; A6. and carrying out parameter iterative optimization on the initial language processing model in the medical field based on the standardized training instruction set to obtain a target medical language processing model in the medical field. In a preferred embodiment, the acquiring the medical corpus of the medical field and the medical field knowledge graph includes: collecting original medical data from an authoritative medical data source, and performing data cleaning on the original medical data to obtain an initial corpus in the medical field; Performing chapter structure analysis and clinical entity identification on the data documents in the initial corpus, and performing semantic annotation on the analyzed data according to the identified clinical entity to construct a medical corpus in the medical field; Extracting concept nodes in a preset medical ontology library and clinical guidelines, and establishing hierarchical relations, attribute relations and logic reasoning rules among the concept nodes according to semantic association among the concept nodes so as to construct an initial knowledge graph of the medical field; and carrying out path pruning on the initial knowledge graph to obtain the knowledge graph of the medical field in the medical field. In a preferred embodiment, the performing fine-grained semantic parsing on the original medical text records in the medical corpus to generate the unstructured entity sequence in the medical domain includes: performing sentence boundary recognition and clause segmentation on the original medical text records in the medical corpus to obtain text fragments of the original medical text records; Performing part-of-speech tagging on the text fragments, and identifying core entities and attribute entities in the text fragments; Performing dependency syntactic analysis on the core entity and the attribute entity to obtain a semantic modification relation of the text fragment; And based on the core entity, the attribute entity and the semantic modification relation, splicing according to the original sequence of the text fragments in the original medical text record, and generating an unstructured entity sequence in the medical field. In a preferred embodiment, the mapping the unstructured entity sequence to the medical domain knowledge graph based on the pathological and seman