CN-121683963-B - Heterogeneous information-based large language model expansion method, system and storage medium

CN121683963BCN 121683963 BCN121683963 BCN 121683963BCN-121683963-B

Abstract

The invention provides a large language model expansion method, a large language model expansion system and a large language model storage medium based on heterogeneous information, wherein the large language model expansion method comprises the steps of obtaining a target document, and identifying document contents in the target document according to regions to obtain a plurality of content nodes; the method comprises the steps of analyzing hierarchical relations among different document contents, constructing a tree topology structure for representing the hierarchical relations by taking content nodes as tree nodes, analyzing reference relations of the different document contents, giving the reference relations to the corresponding content nodes, storing the tree topology structure in a database as extension data of a large language model, storing the content nodes by using different storage structures according to different types of the document contents, and interacting the database based on the large language model to obtain interaction results. The invention is convenient for supporting dynamic granularity adaptation and multi-round feedback retrieval, and is beneficial to improving the intelligent level and interaction capability of the AI system.

Inventors

Yan Baixu
XIA JIN

Assignees

睿思芯科(深圳)技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260209

Claims (9)

1. The large language model expansion method based on the heterogeneous information is characterized by comprising the following steps of: S101, acquiring a target document for knowledge expansion of a large language model, and identifying document contents in the target document according to regions to obtain a plurality of content nodes; S102, analyzing hierarchical relations among the document contents corresponding to different content nodes, constructing a tree topology structure for representing the hierarchical relations by taking the content nodes as tree nodes, analyzing reference relations of the document contents corresponding to different content nodes, and endowing the reference relations to the content nodes which are corresponding to the tree topology structure; The tree topology structure constructed by the content nodes with the reference relation is used as the extension data of the large language model to be stored in a database, wherein the content nodes are stored in the database by using different storage structures according to different types of the document content corresponding to the content nodes; S103, interacting the database based on the large language model to obtain an interaction result; Wherein step S102 comprises the sub-steps of: acquiring the contained or adjacent hierarchical relations between the document contents corresponding to different content nodes based on a preset hierarchical deriving method; based on the hierarchical relationship, the content nodes are used as tree nodes, the tree topology structure is constructed, the hierarchical relationship is made to be a structure edge of the tree topology structure, and corresponding node IDs are distributed for each content node in the tree topology structure; Acquiring the reference relation between the document contents corresponding to different content nodes based on a preset semantic analysis method, and endowing the reference relation to the corresponding content nodes which already form the tree topology structure, so that the reference relation becomes a semantic edge of the tree topology structure; Defining the content nodes by using different storage structures according to different types of the document contents corresponding to the content nodes; storing the storage structure in the database as extension data of the large language model; And constructing an association table for inquiring the content nodes in the database according to the target document, the content nodes, the hierarchical relationship, the reference relationship and the tree topology structure, and storing the association table in the database.
2. The heterogeneous information-based large language model expansion method according to claim 1, wherein the type of the document content includes text-type information and non-text-type information, and step S101 further includes the steps of: Extracting texts contained in the text type information, and taking the obtained texts as the content nodes corresponding to the text type information; and extracting an original coordinate frame of the non-text information in the page of the target document, intercepting a corresponding screenshot from the page according to the original coordinate frame, and taking the obtained screenshot as the content node corresponding to the non-text information.
3. The heterogeneous information based large language model expansion method of claim 2, wherein the storage structure comprises at least one of a node ID, a type of the document content, metadata of the document content, the hierarchical relationship, and the reference relationship; Wherein, for the text class information, the storage structure further comprises at least one item of text content, a feature vector of the text content, a father node ID of the corresponding content node, and a brother node ID of the corresponding content node; For the non-text type information, the storage structure further comprises at least one item of screenshot content, an original coordinate frame, a drawing text, text in the screenshot content, a storage path of the screenshot content, feature vectors of the screenshot content, notes of the screenshot content and father node IDs of the corresponding content nodes.
4. The heterogeneous information based large language model expansion method of claim 1, wherein the database comprises: The physical sub-database is used for storing the document content corresponding to each content node; a logic sub-database, configured to store the hierarchical relationship corresponding to each content node; And the semantic sub-database is used for storing the reference relation corresponding to each content node.
5. The heterogeneous information based large language model expansion method of claim 4, wherein the association table comprises at least one of: a document metadata table for storing metadata of the target document; A core structure table for storing an index of the hierarchical relationship; a content detail table for storing metadata of the document content corresponding to the content node; A semantic graph edge table for storing an index of the reference relationship; and the vector table is used for storing the vector index of the document content corresponding to the content node.
6. The large language model expansion method based on heterogeneous information according to claim 1, wherein step S103 comprises the sub-steps of: obtaining an item to be queried, carrying out retrieval based on vector similarity in the database according to the item to be queried to obtain a plurality of content nodes, and taking the obtained plurality of content nodes as anchor nodes; Based on the association table, carrying out upward expansion and/or adjacent expansion and/or cross-level expansion retrieval on the anchor node based on the level relation and the reference relation in the database to obtain a plurality of content nodes associated with the anchor node, and taking the obtained plurality of content nodes as expansion nodes; And uniformly arranging the document contents corresponding to the anchor point node and the expansion node respectively into interactive text for outputting the large language model, and taking the interactive text as the interactive result.
7. A large language model extension system based on heterogeneous information, comprising: the document analysis module is used for acquiring a target document for knowledge expansion of the large language model, and identifying document contents in the target document according to regions to obtain a plurality of content nodes; the tree structure module is used for analyzing the hierarchical relationship among the document contents corresponding to different content nodes, constructing a tree topology structure for representing the hierarchical relationship by taking the content nodes as tree nodes, analyzing the reference relationship of the document contents corresponding to different content nodes, and endowing the reference relationship to the corresponding content nodes which already form the tree topology structure; The tree topology structure constructed by the content nodes with the reference relation is used as the extension data of the large language model to be stored in a database, wherein the content nodes are stored in the database by using different storage structures according to different types of the document content corresponding to the content nodes; The retrieval reasoning module is used for interacting the database based on the large language model to obtain an interaction result; Wherein, the tree structure module is specifically used for: acquiring the contained or adjacent hierarchical relations between the document contents corresponding to different content nodes based on a preset hierarchical deriving method; based on the hierarchical relationship, the content nodes are used as tree nodes, the tree topology structure is constructed, the hierarchical relationship is made to be a structure edge of the tree topology structure, and corresponding node IDs are distributed for each content node in the tree topology structure; Acquiring the reference relation between the document contents corresponding to different content nodes based on a preset semantic analysis method, and endowing the reference relation to the corresponding content nodes which already form the tree topology structure, so that the reference relation becomes a semantic edge of the tree topology structure; Defining the content nodes by using different storage structures according to different types of the document contents corresponding to the content nodes; storing the storage structure in the database as extension data of the large language model; And constructing an association table for inquiring the content nodes in the database according to the target document, the content nodes, the hierarchical relationship, the reference relationship and the tree topology structure, and storing the association table in the database.
8. A computer device comprising a memory, a processor and a heterogeneous information based large language model extension program stored on the memory and executable on the processor, the processor implementing the steps in the heterogeneous information based large language model extension method of any one of claims 1-6 when executing the heterogeneous information based large language model extension program.
9. A storage medium having stored thereon a heterogeneous information based large language model extension program which when executed by a processor implements the steps of the heterogeneous information based large language model extension method according to any of claims 1-6.

Description

Heterogeneous information-based large language model expansion method, system and storage medium Technical Field The invention is suitable for the technical field of artificial intelligence, and particularly relates to a large language model expansion method, a large language model expansion system and a large language model storage medium based on heterogeneous information. Background Along with the development of deep learning technology, a large language model based on a transducer is excellent in natural language processing tasks, but when a general large model is landed in the vertical field, the realization of low-cost and high-efficiency knowledge expansion becomes a core technical problem, and the optimization and upgrading of related technical schemes become urgent demands for industry development. The existing large model knowledge expansion technology mainly comprises three implementation schemes, namely a first scheme of full heavy training (RETRAINING) and efficient Fine tuning of parameters (Fine-tuning/LoRA), wherein the model is retrained or Fine-tuned through a new corpus to realize knowledge internalization, a second scheme of model editing (Model Editing/Knowledge Editing), the updating of specific knowledge is completed through directly modifying model parameters, and a third scheme of retrieval enhancement generation (RETRIEVAL-Augmented Generation, RAG), wherein the retrieval enhancement generation is used as a current mainstream scheme, a core paradigm of PDF analysis, text slicing (Chunking) and vector retrieval is generally adopted, and related context information is supplemented for the model through an externally hung knowledge base without retraining the model. However, all kinds of knowledge expansion technologies have significant technical defects, and the short plates of the mainstream search enhancement generation scheme are particularly prominent: the full weight retraining and fine adjustment has high calculation force and time cost, is easy to cause the problems of disastrous forgetting or semantic drift, and is difficult to accurately correct the learned error knowledge; The model editing operation is complex, the internal self-consistency of the model is difficult to ensure, the operation is complex, and the original general reasoning capacity of the model is easy to destroy; The core problem of the retrieval enhancement generation scheme is that a general storage format for retaining a complete logic structure, layout space information and multi-mode association of a document is lacking, and the retrieval enhancement generation scheme is characterized in that the document structure information is lost in an analysis link, the multi-mode information is split, the granularity of the retrieval link is stiff and dynamic interaction capability is avoided, meanwhile, the multi-mode large model visual understanding capability is limited by lossy compression of data, the system expandability is poor, and finally the long document reasoning capability improvement and the construction of a high-grade intelligent body of the retrieval enhancement generation system are limited. Therefore, there is a need to propose a new knowledge extension scheme of a large language model to solve the above-mentioned problems. Disclosure of Invention The invention provides a large language model expansion method, a large language model expansion system and a large language model storage medium based on heterogeneous information, and aims to solve the technical problems that the existing expansion method is difficult to perfect multi-mode information and has poor retrieval and expansion capability. In order to solve the technical problems, in a first aspect, the present invention provides a large language model expansion method based on heterogeneous information, including the following steps: S101, acquiring a target document for knowledge expansion of a large language model, and identifying document contents in the target document according to regions to obtain a plurality of content nodes; S102, analyzing hierarchical relations among the document contents corresponding to different content nodes, constructing a tree topology structure for representing the hierarchical relations by taking the content nodes as tree nodes, analyzing reference relations of the document contents corresponding to different content nodes, and endowing the reference relations to the content nodes which are corresponding to the tree topology structure; The tree topology structure constructed by the content nodes with the reference relation is used as the extension data of the large language model to be stored in a database, wherein the content nodes are stored in the database by using different storage structures according to different types of the document content corresponding to the content nodes; S103, interacting the database based on the large language model to obtain an interaction result. Still further, step S102