Search

CN-121981221-A - Railway disaster knowledge graph intelligent construction method based on fine tuning Qwen model

CN121981221ACN 121981221 ACN121981221 ACN 121981221ACN-121981221-A

Abstract

The application relates to the technical field of knowledge graph construction and discloses an intelligent construction method of a railway disaster knowledge graph based on a fine tuning Qwen model. The method comprises the steps of carrying out format unified conversion and deep cleaning on a multi-source railway disaster text to obtain sentence-level corpus, inputting Qwen a model, carrying out triple crude extraction, carrying out fine adjustment training by adopting a LoRA low-rank adaptation method, carrying out batch reasoning on the whole corpus, carrying out de-duplication standardization according to a minimum editing distance and Jaccard similarity fusion algorithm, and importing a Neo4j graph database to construct a railway disaster knowledge graph. The application improves the automation degree, knowledge extraction accuracy and map structure normalization of the construction of the railway disaster knowledge map.

Inventors

  • YANG HAIBO
  • NIU MU
  • ZHANG YANG
  • HAN XINJIAN
  • LI LA
  • SONG LIN
  • LIU XUEBIN
  • JIANG BIAO
  • Dong Tianji
  • HAN HAOBO
  • ZHAN GUANGZHEN
  • ZHANG HUI
  • LI SHAOYUAN
  • MA XIZHONG
  • LI LIN
  • LIANG BING

Assignees

  • 郑州大学
  • 中国铁路郑州局集团有限公司

Dates

Publication Date
20260505
Application Date
20251225

Claims (10)

  1. 1. The intelligent construction method of the railway disaster knowledge graph based on the fine tuning Qwen model is characterized by comprising the following steps of: step S1, carrying out format unified conversion and character standardization processing on a multi-source railway disaster text to obtain a unified format text data set; Step S2, deep cleaning and sentence-level segmentation are carried out on the unified format text data set according to a disabled word stock in the railway disaster field to obtain a railway disaster sentence-level corpus; step 3, inputting the railway disaster sentence-level corpus set into a Qwen model, carrying out coarse extraction by combining a triplet extraction prompting template, obtaining a refined labeling triplet set after manual correction, and carrying out fine adjustment training on a Qwen model by adopting a LoRA low-rank adaptation technology to obtain a fine adjustment Qwen model; And S4, inputting the railway disaster sentence-level corpus set into the fine tuning Qwen model for batch reasoning to obtain a full-quantity triplet set, de-duplication normalization is carried out on entities and relations in the full-quantity triplet set according to a fusion algorithm of a minimum editing distance and Jaccard similarity to obtain a normalized triplet set, and the normalized triplet set is imported into a Neo4j graph database to obtain a railway disaster knowledge graph.
  2. 2. The intelligent construction method of the railway disaster knowledge graph based on the fine tuning Qwen model according to claim 1, wherein the step S1 includes: Collecting PDF format reports, DOCX format documents and HTML format webpages related to railway disasters, extracting text contents page by page from the PDF files through fitz libraries, extracting text contents from the DOCX files according to paragraph sequences through python-DOCX libraries, extracting text contents from the HTML webpages through a requests library and an analysis library, and obtaining multi-source format texts; removing continuous character breaks in the multi-source format text by using a regular expression, and uniformly replacing line-feed symbols with spaces to obtain a format uniform text; traversing each character in the unified text, judging whether the character belongs to full-angle characters according to Unicode coding values, and converting the full-angle characters into half-angle characters by executing conversion operation of coding point minus 0xFEE0 to obtain half-angle standard texts; And filtering special characters except for Chinese characters, english letters, numbers and common Chinese and English punctuation marks in the half-angle standard text by using a regular expression to obtain the unified format text data set.
  3. 3. The intelligent construction method of the railway disaster knowledge graph based on the fine tuning Qwen model according to claim 1, wherein the step S2 includes: For academic journal papers and scientific report texts in the unified format text data set, deleting at foot reference lists, annotation descriptions and reference annex contents in batches by using regular expressions to obtain an annex-removing text; Identifying and deleting fixed format meta-information fields by using regular expressions for official notices and news media reports in the attachment removing text, identifying and removing structural marks of non-descriptive main bodies by using regular expressions for social media platform texts to obtain main body content texts; performing character string matching on the main content text according to the railway disaster field disabling word stock, deleting the modified vocabulary which does not bear substantial disaster fact information, and obtaining a purified text; and using a regular expression to take a period, a semicolon, a question mark and an exclamation mark as natural semantic boundaries to split the purified text at sentence level, filtering out too short sentences with the number of characters less than 10 and too long sentences with the number of characters more than 500, and deleting invalid sentences only comprising numbers, punctuations or single vocabulary repetition to obtain the railway disaster sentence level corpus.
  4. 4. The intelligent construction method of the railway disaster knowledge graph based on the fine tuning Qwen model according to claim 1, wherein the step S3 includes: Constructing a triplet extraction prompt template, wherein the triplet extraction prompt template defines a subject type, a relation type and an object type, and inputting Qwen models for reasoning after splicing the triplet extraction prompt template with each sentence in the railway disaster sentence level corpus to obtain a rough extraction triplet text; Splitting the crude extraction triplet text by using a line feed character, executing format checksum field length check on each triplet, and filtering the triples which do not meet the conditions to obtain a crude extraction triplet set; Extracting a sample triplet from the rough extraction triplet set, and manually checking the relation type, the entity expression and the triplet integrity by a field expert to obtain the refined labeling triplet set; And arranging the refined labeling triplet set into a supervision data format and dividing the supervision data format into a training set, a verification set and a test set, selecting Qwen-14B-Chat as a basic model, introducing a low-rank decomposition increment on the basis of a pre-training model weight matrix by adopting a LoRA low-rank adaptation method to carry out fine-tuning training, and obtaining the fine-tuning Qwen model.
  5. 5. The intelligent construction method of railway disaster knowledge graph based on fine tuning Qwen model according to claim 4, wherein the sorting the fine labeling triplet set into a supervision data format and dividing the supervision data format into a training set, a verification set and a test set, selecting Qwen-14B-Chat as a basic model, introducing a low-rank decomposition increment on the basis of a pre-training model weight matrix by adopting a LoRA low-rank adaptation method to perform fine tuning training, and obtaining the fine tuning Qwen model comprises: Each triplet in the refined labeling triplet set is organized into a training sample in a JSON format, and each training sample comprises an instruction field, an input field and an output field and is divided into a training set, a verification set and a test set according to the proportion of 8 to 1; Introducing a low-rank decomposition increment delta W to be equal to BA on the basis of a pre-training model weight matrix W of Qwen-14B-Chat, wherein B and A are trainable low-rank matrices, freezing the pre-training model weight matrix W, only optimizing the low-rank matrices B and A, and setting rank parameters, scaling factors and dropout parameters; Applying LoRA low-rank adaptation to a projection matrix in a transducer layer, setting a learning rate, a batch size, a gradient accumulation step number and a training round number, and training a model by using the training set; And monitoring the triad extraction F1 fraction on the verification set in the training process, triggering an early-stopping mechanism and saving a model check point when the F1 fraction is continuously preset and the epoch is not lifted, and obtaining the fine tuning Qwen model.
  6. 6. The intelligent construction method of the railway disaster knowledge graph based on the fine tuning Qwen model as claimed in claim 5, further comprising: inputting the test set into the fine tuning Qwen model for reasoning, and comparing the reasoning result with a standard answer to obtain a test result; counting the correct number and total number of the subject entity identification according to the test result, calculating the accuracy of the subject entity identification, counting the correct number and total number of the object entity identification, calculating the accuracy of the object entity identification, counting the correct number and total number of the relationship type judgment, and calculating the accuracy of the relationship type judgment; Calculating the integral triplet extraction F1 fraction according to the subject entity identification accuracy, the object entity identification accuracy and the relationship type judgment accuracy; Comparing the whole triad extraction F1 score of the fine tuning Qwen model with the whole triad extraction F1 score of the untrimmed Qwen base model, and when the whole triad extraction F1 score of the fine tuning Qwen model is higher than the whole triad extraction F1 score of the untrimmed Qwen base model, confirming that the fine tuning Qwen model meets the requirement of the knowledge extraction performance in the railway disaster field.
  7. 7. The intelligent construction method of the railway disaster knowledge graph based on the fine tuning Qwen model according to claim 1, wherein in the step S4, the railway disaster sentence-level corpus is input into the fine tuning Qwen model for batch reasoning, so as to obtain a full-quantity triplet set, which comprises the following steps: inputting each sentence in the railway disaster sentence level corpus into the fine tuning Qwen model, and setting a temperature parameter, a sampling probability parameter and a repeated punishment parameter to obtain a generated text; extracting triples conforming to a subject-relation-object format from the generated text by using a regular expression, and performing head-tail blank character removal processing on subject fields, relation fields and object fields of the extracted triples; checking whether the lengths of the subject field and the object field are in the range of 2-80 characters, and whether the lengths of the relation fields are in the range of 2-20 characters, and filtering triples which do not meet the length requirement; checking whether the triples contain punctuation marks or special control characters with more than 3 continuous, filtering triples containing abnormal characters, merging the checked triples to obtain the full-quantity triples set.
  8. 8. The intelligent construction method of the railway disaster knowledge graph based on the fine tuning Qwen model according to claim 7, wherein in the step S4, the entity and the relation in the full-quantity triplet set are de-duplicated and normalized according to a fusion algorithm of a minimum editing distance and Jaccard similarity, so as to obtain a normalized triplet set, which comprises the following steps: Splitting the total triplet set into an entity table and a relation table, wherein the entity table comprises the union of the subject fields and the object fields of all triples, and the relation table comprises the union of the relation fields of all triples; Constructing a two-dimensional matrix for each pair of candidate entities in the entity table through a dynamic programming algorithm, initializing a first row and a first column of the matrix, calculating matrix element values according to a recurrence formula, and obtaining a minimum editing distance when the cost is 0 when the characters are equal or 1 when the cost is 1; for each pair of candidate entities in the entity table, splitting a character string into sets according to characters, calculating the quantity of intersection elements and the quantity of union elements of the two sets, and dividing the quantity of intersection elements by the quantity of union elements to obtain Jaccard similarity; and calculating the comprehensive similarity according to the comprehensive similarity function, wherein the comprehensive similarity is equal to the first weight coefficient multiplied by the normalized editing distance similarity plus the second weight coefficient multiplied by the Jaccard similarity, the first weight coefficient is set to be 0.6, the second weight coefficient is set to be 0.4, the threshold value is set to be 0.8, and when the comprehensive similarity is greater than or equal to 0.8, the two entities are judged to belong to the same concept.
  9. 9. The intelligent construction method of the railway disaster knowledge graph based on the fine tuning Qwen model according to claim 8, further comprising: managing entity equivalent classes in the entity table by adopting a union data structure, initializing each entity into an independent set, traversing all entity pairs, and executing merging operation to merge two entities into the same equivalent class when the comprehensive similarity is more than or equal to 0.8; selecting a representative entity for each equivalence class, preferentially selecting an entity with the shortest character string length as the representative entity, and selecting an entity with the highest occurrence frequency as the representative entity when a plurality of entities with the shortest length exist; The same similarity calculation and equivalent class merging flow are executed on the relation table, and a normalized representative relation is obtained; And traversing the full-quantity triplet set, replacing a subject field, a relation field and a guest field of each triplet with corresponding representative entities and representative relations, and removing the completely repeated triples to obtain the normalized triplet set.
  10. 10. The intelligent construction method of the railway disaster knowledge graph based on the fine tuning Qwen model according to claim 9, wherein in the step S4, the normalized triplet set is imported into a Neo4j graph database to obtain the railway disaster knowledge graph, and the method comprises the following steps: Splitting the normalized triplet set, generating a subject field and an object field into a real table, and generating a relationship table from a relationship field; Configuring type labels for nodes in the entity table according to entity categories, wherein the type labels comprise disaster event categories, railway infrastructure categories, running state categories, emergency treatment categories and environment factors, and setting relationship types for relationships in the relationship table; Importing the entity table and the relation table into a Neo4j graph database by using a batch importing command of Neo4j, and recording time attribute, space attribute, facility type attribute and running state attribute according to node type labels and relation types in the importing process; and adopting different color marks for nodes of different entity types and different arrow patterns for edges of different relation types in a Neo4j visual interface to obtain the railway disaster knowledge graph.

Description

Railway disaster knowledge graph intelligent construction method based on fine tuning Qwen model Technical Field The application relates to the technical field of knowledge graph construction, in particular to an intelligent construction method of a railway disaster knowledge graph based on a fine tuning Qwen model. Background Railway traffic is taken as a national important infrastructure, and the operation safety of the railway traffic is influenced by geological conditions, meteorological factors and surrounding environments of lines, and disaster events such as line collapse, falling stone invasion, subgrade settlement, debris flow, storm flushing, wind damage and snow damage and the like are easy to occur. In order to realize railway disaster risk analysis, early warning and auxiliary decision, a railway disaster knowledge graph covering disaster types, cause mechanisms, influence factors and disposal measures needs to be constructed, and the extraction of structured knowledge from massive unstructured texts becomes a key step. The existing knowledge graph construction method mainly comprises methods based on rules, traditional machine learning, open information extraction and remote supervision, and the methods have achieved certain achievements in the railway disaster field and can extract entity and relation information from part of texts. However, the prior art still has significant shortcomings in terms of the professionality, the richness of hidden relations and the difficulty of data annotation of the railway disaster text. The rule-based method relies on manual design templates and keyword rules, is difficult to adapt to the characteristics of complex expression and multiple terms of railway disaster texts, and has poor expandability. The method based on traditional machine learning requires a large amount of manual annotation data, and the railway professional corpus annotation cost is high and the scale is limited, so that the model generalization capability is insufficient. The method based on open information extraction is easy to generate the problems of inaccurate extraction, incomplete relationship and the like when professional texts such as accident reports, maintenance records and the like are processed. The method based on remote supervision relies on the existing knowledge base, but the systematic knowledge base related to railway disasters is lacking, the sudden disaster types and the continuously changing environmental factors are difficult to cover, and the limitations cause that the prior art is difficult to meet the requirement of the railway disaster field on high-quality knowledge extraction. Disclosure of Invention The application provides an intelligent construction method of a railway disaster knowledge graph based on a fine tuning Qwen model, which is used for carrying out fine tuning training on the railway disaster field of a Qwen model by adopting a LoRA low-rank adaptation method and carrying out de-duplication standardization on the extracted entity relationship by combining a minimum editing distance and a Jaccard similarity fusion algorithm, so that the problems of insufficient field adaptation of a general model, low knowledge extraction accuracy and knowledge redundancy caused by multi-source text expression difference in the prior art are solved, and the automation degree, the knowledge extraction accuracy and the graph structure normalization of the construction of the railway disaster knowledge graph are improved. The application provides an intelligent construction method of a railway disaster knowledge graph based on a fine tuning Qwen model, which comprises the following steps of: step S1, carrying out format unified conversion and character standardization processing on a multi-source railway disaster text to obtain a unified format text data set; Step S2, deep cleaning and sentence-level segmentation are carried out on the unified format text data set according to a disabled word stock in the railway disaster field to obtain a railway disaster sentence-level corpus; step 3, inputting the railway disaster sentence-level corpus set into a Qwen model, carrying out coarse extraction by combining a triplet extraction prompting template, obtaining a refined labeling triplet set after manual correction, and carrying out fine adjustment training on a Qwen model by adopting a LoRA low-rank adaptation technology to obtain a fine adjustment Qwen model; And S4, inputting the railway disaster sentence-level corpus set into the fine tuning Qwen model for batch reasoning to obtain a full-quantity triplet set, de-duplication normalization is carried out on entities and relations in the full-quantity triplet set according to a fusion algorithm of a minimum editing distance and Jaccard similarity to obtain a normalized triplet set, and the normalized triplet set is imported into a Neo4j graph database to obtain a railway disaster knowledge graph. According to the technical