CN-121997943-A - Semantic knowledge block dividing method and system based on AUTOSAR protocol document
Abstract
The invention discloses a semantic knowledge block dividing method and a system based on an AUTOSAR protocol document, wherein the method comprises the following steps of carrying out standardized processing on the AUTOSAR protocol document; the method comprises the steps of extracting multi-modal features from a processed AUTOSAR protocol document to generate a plurality of high-dimensional embedded vectors and a plurality of visual embedded vectors, constructing a document knowledge graph based on a formalized knowledge base of an AUTOSAR domain ontology according to the plurality of high-dimensional embedded vectors and the plurality of visual embedded vectors, and carrying out semantic knowledge block division on the document knowledge graph through a semantic subgraph division method. According to the invention, the semantic knowledge block division is carried out on the constructed document knowledge graph by the semantic subgraph division method, so that all logically related multi-mode information is ensured to be aggregated in the same knowledge block, and the problem of context fragmentation caused by the traditional knowledge block division method is solved.
Inventors
- ZHU DUNYAO
- XU DAOMING
- ZHOU WENZHONG
- LU KAIXUAN
- LUO YUEJUN
- WANG SHICHAO
- ZHANG LONG
Assignees
- 武汉光庭信息技术股份有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260313
Claims (9)
- 1. A semantic knowledge block partitioning method based on an AUTOSAR protocol document, the method comprising the steps of: s1, carrying out standardized processing on an AUTOSAR protocol document; s2, extracting multi-modal features from the processed AUTOSAR protocol document to generate a plurality of high-dimensional embedded vectors and a plurality of visual embedded vectors; s3, constructing a document knowledge graph according to a plurality of high-dimensional embedded vectors and a plurality of visual embedded vectors based on a formalized knowledge base of an AUTOSAR field ontology; S4, carrying out semantic knowledge block division on the document knowledge graph through a semantic subgraph division method.
- 2. The method of claim 1, wherein S1 comprises: s1.1, analyzing an AUTOSAR protocol document to obtain document original data; S1.2, cleaning the original document data, and performing unicode on the cleaned original document data.
- 3. The method of claim 1, wherein S2 comprises: S2.1, extracting text features from the processed AUTOSAR specification document through a feature extractor of LayoutLMv framework; S2.2, generating a plurality of high-dimensional embedded vectors according to the text characteristics; s2.3, extracting key visual features from the processed AUTOSAR specification document through a DiT-architecture visual encoder; s2.4, generating a plurality of visual embedding vectors according to the key visual features.
- 4. The method of claim 1, wherein prior to S3, comprising: Defining entities in a plurality of AUTOSAR standards, and determining semantic relations among the entities; And constructing a formalized knowledge base of the AUTOSAR field ontology according to the plurality of entities and the semantic relation among the entities.
- 5. The method of claim 4, wherein S3 comprises: S3.1, respectively taking a plurality of high-dimensional embedded vectors and a plurality of visual embedded vectors as each map node; s3.2, generating a plurality of groups of node pairs according to each map node; s3.3, traversing a plurality of groups of node pairs, and taking the traversed node pairs as node pairs to be processed; S3.4, matching the node pairs to be processed with a plurality of groups of entity pairs in the formalized knowledge base respectively through an entity linking technology; S3.5, after successful matching, obtaining the semantic relation between the node pairs to be processed through a relation classifier according to the node pairs to be processed and the entity pairs successfully matched, and returning to the S3.3 until the semantic relation between all the node pairs is obtained; and S3.6, constructing a document knowledge graph according to the semantic relation between each graph node and all node pairs.
- 6. The method of claim 5, wherein the obtaining the semantic relationship between the pair of nodes to be processed by a relationship classifier based on the pair of nodes to be processed and the pair of successfully matched entities comprises: splicing the node pairs to be processed with the entity pairs successfully matched to obtain joint characteristics; and inputting the semantic relation between the joint features and the entity pairs successfully matched into a relation classifier, and outputting the semantic relation between the node pairs to be processed.
- 7. The method of claim 6, wherein S4 comprises: s4.1, selecting seed nodes from a plurality of map nodes; s4.2, traversing the directed graph on the basis of the seed node on the document knowledge graph, and determining the semantic cohesion of the current growth subgraph; and S4.3, when the semantic cohesion is smaller than a preset threshold, carrying out semantic knowledge block division on the document knowledge graph according to the current growth subgraph.
- 8. The method of claim 1, wherein after S4, comprising: s5, serializing the divided semantic knowledge blocks, and generating a retrieval embedded vector according to the serialized semantic knowledge blocks; S6, storing the serialized semantic knowledge blocks and the corresponding search embedded vectors into a vector database for knowledge indexing.
- 9. A semantic knowledge block partitioning system based on an AUTOSAR reduction document, the system comprising: The processing module is used for carrying out standardized processing on the AUTOSAR protocol document; the extraction module is used for extracting multi-modal characteristics from the processed AUTOSAR protocol document so as to generate a plurality of high-dimensional embedded vectors and a plurality of visual embedded vectors; The construction module is used for constructing a document knowledge graph according to the plurality of high-dimensional embedded vectors and the plurality of visual embedded vectors based on the formalized knowledge base of the AUTOSAR field ontology; and the dividing module is used for dividing the semantic knowledge blocks of the document knowledge graph through a semantic subgraph dividing method.
Description
Semantic knowledge block dividing method and system based on AUTOSAR protocol document Technical Field The invention relates to the technical field of document processing, in particular to a semantic knowledge block dividing method and system based on an AUTOSAR protocol document. Background The automobile open system architecture (AUTOSAR) is a core standard for global automobile industry software development. The official specification document is the fundamental basis for engineers to perform system design, software development and functional security analysis. These documents are of high complexity, typically published in PDF format, with content incorporating a large number of terms of art, definition lists, parameter tables, and visual elements to illustrate complex logic, such as system architecture diagrams, software component interaction diagrams, state machine diagrams, and timing diagrams. To build a knowledge source for the retrieval enhancement generation (RETRIEVAL AUGMENTED GENERATION RAG) system, the document needs to be partitioned. The partitioning strategy comprises fixed-size partitioning, recursive character partitioning or semantic partitioning based on sentences/paragraphs, but the conventional partitioning strategy causes cross-page and cross-modal logic association information loss in the AUTOSAR protocol document. Therefore, how to solve the problem of context fragmentation caused by the conventional blocking strategy is a urgent problem to be solved. The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art. Disclosure of Invention The invention mainly aims to provide a semantic knowledge block dividing method and a semantic knowledge block dividing system based on an AUTOSAR protocol document, which aim at solving the problem of context fragmentation caused by a traditional blocking strategy. In order to achieve the above object, the present invention provides a semantic knowledge block dividing method based on an AUTOSAR protocol document, which includes: s1, carrying out standardized processing on an AUTOSAR protocol document; s2, extracting multi-modal features from the processed AUTOSAR protocol document to generate a plurality of high-dimensional embedded vectors and a plurality of visual embedded vectors; s3, constructing a document knowledge graph according to a plurality of high-dimensional embedded vectors and a plurality of visual embedded vectors based on a formalized knowledge base of an AUTOSAR field ontology; S4, carrying out semantic knowledge block division on the document knowledge graph through a semantic subgraph division method. Optionally, the S1 includes: s1.1, analyzing an AUTOSAR protocol document to obtain document original data; S1.2, cleaning the original document data, and performing unicode on the cleaned original document data. Optionally, the S2 includes: S2.1, extracting text features from the processed AUTOSAR specification document through a feature extractor of LayoutLMv framework; S2.2, generating a plurality of high-dimensional embedded vectors according to the text characteristics; s2.3, extracting key visual features from the processed AUTOSAR specification document through a DiT-architecture visual encoder; s2.4, generating a plurality of visual embedding vectors according to the key visual features. Optionally, before S3, the method includes: Defining entities in a plurality of AUTOSAR standards, and determining semantic relations among the entities; And constructing a formalized knowledge base of the AUTOSAR field ontology according to the plurality of entities and the semantic relation among the entities. Optionally, the S3 includes: S3.1, respectively taking a plurality of high-dimensional embedded vectors and a plurality of visual embedded vectors as each map node; s3.2, generating a plurality of groups of node pairs according to each map node; s3.3, traversing a plurality of groups of node pairs, and taking the traversed node pairs as node pairs to be processed; S3.4, matching the node pairs to be processed with a plurality of groups of entity pairs in the formalized knowledge base respectively through an entity linking technology; S3.5, after successful matching, obtaining the semantic relation between the node pairs to be processed through a relation classifier according to the node pairs to be processed and the entity pairs successfully matched, and returning to the S3.3 until the semantic relation between all the node pairs is obtained; and S3.6, constructing a document knowledge graph according to the semantic relation between each graph node and all node pairs. Optionally, the obtaining, according to the pair of nodes to be processed and the successfully matched entity pair, the semantic relationship between the pair of nodes to be processed through a relationship classifier inclu