Search

CN-121615792-B - Standardized data management system based on RAG and large model warehouse

CN121615792BCN 121615792 BCN121615792 BCN 121615792BCN-121615792-B

Abstract

The invention relates to the technical field of data management, in particular to a standardized data management system based on an RAG and large model warehouse, which comprises a data fusion module, a standard knowledge evolution module, a primary detection module and an RAG enhancement and model warehouse module, wherein the data fusion module is used for obtaining a multi-source unified semantic data set, the standard knowledge evolution module is used for forming a standard subject database, the primary detection module is used for generating a first search result based on an external task request, and the RAG enhancement and model warehouse module is used for carrying out reasoning generation based on a preset multi-mode model, an external task request and the first search result to obtain a second search result. According to the invention, the standard subject library is constructed to form the first retrieval result, and the first retrieval result is used as a boundary condition to verify the result formed by the RAG enhancement and the model warehouse module, so that the accuracy of the data generation result is improved, and model reasoning deviation and error knowledge diffusion are prevented.

Inventors

  • HAO WENJIAN
  • MA WENHUI
  • ZHU XIANGYU
  • HOU XUEYING
  • Gao Yanxuan
  • LIU XIAOHUI
  • HU CHEN
  • Tan Ruihu
  • FAN XIAOYIN
  • QIU SHIRUI
  • WANG KUNYANG
  • LI DI

Assignees

  • 北京赛西科技发展有限责任公司
  • 赛西(深圳)电子信息产品标准化工程中心有限公司

Dates

Publication Date
20260512
Application Date
20260202

Claims (8)

  1. 1. A standardized data management system based on RAG and a large model warehouse is characterized by comprising a data fusion module, a standard knowledge evolution module, a primary detection module and a RAG enhancement and model warehouse module; The data fusion module is used for extracting structured and unstructured data from a plurality of distributed databases and obtaining a multi-source unified semantic data set based on a cross-source semantic alignment algorithm; the standard knowledge evolution module is used for carrying out standard semantic reconstruction on the semantic data set by utilizing a preset corpus algorithm to form a standard subject library; The primary detection module is used for searching the standard subject library based on an external task request to generate a first search result of the task request; the RAG enhancement and model warehouse module is used for carrying out reasoning generation based on a preset multi-modal model, an external task request and the first search result to obtain a second search result, wherein the preset multi-modal model comprises at least one general big model and a plurality of theme models; the reasoning generation is performed based on a preset multi-mode model, an external task request and the first search result to obtain a second search result, including: Determining a corresponding topic model based on the external task request; According to the external task request, carrying out joint semantic reasoning on the external task request through the general big model and the corresponding topic model to obtain a reasoning result; And verifying the reasoning result by taking the first search result as a boundary condition to obtain a second search result.
  2. 2. The RAG and large model warehouse based standardized data management system of claim 1 wherein the data fusion module extracts structured and unstructured data from a plurality of distributed databases and obtains a multi-source unified semantic data set based on a cross-source semantic alignment algorithm comprising: performing connection configuration on a plurality of distributed databases, and establishing a data access channel; performing field mapping and primary key identification on the structured data to generate a preliminary structured index table; Extracting features of the unstructured data, and extracting semantic vectors by using a text embedding model; Based on a cross-source semantic alignment algorithm, carrying out semantic aggregation matching on the structured index table and the semantic vector to obtain a cross-source entity alignment result; And according to the cross-source entity alignment result, fusing and generating a multi-source semantic data set under the unified semantic space.
  3. 3. The system for managing standardized data based on RAG and large model warehouse of claim 1, wherein the standard knowledge evolution module is configured to reconstruct the semantic data set into standard semantics by using a preset corpus algorithm to form a standard subject library, and comprises: performing field feature recognition and term extraction on the semantic data set to obtain a theme and a related term set; according to the subject, calculating semantic similarity among terms in related term sets by using a preset standard domain corpus algorithm, and carrying out standard semantic reconstruction on terms with similar semantics according to a similarity threshold value to generate a standardized term set; Establishing semantic relation continuous edges according to semantic levels, logic dependencies and association strengths among terms in the standardized term set; And generating a standard subject library based on the standardized term set and the semantic relation continuous edge.
  4. 4. The system for managing standardized data based on RAG and large model warehouse of claim 3, wherein according to the subject, calculating semantic similarity between terms in the related term set by using a preset standard domain corpus algorithm, and performing standard semantic reconstruction on terms with similar semantics according to a similarity threshold, to generate a standardized term set, comprising: Determining a corresponding preset standard domain corpus algorithm according to the topic, wherein the preset standard domain corpus algorithm is constructed based on a topic corpus; performing word vectorization processing on each term in the related term set to obtain a semantic embedded vector of the term; calculating semantic similarity scores of any two terms by adopting a cosine similarity calculation model; When the semantic similarity score is greater than or equal to a set similarity threshold, dividing corresponding terms into the same semantic cluster, wherein the similarity threshold is determined based on a theme; identifying primary terms within each semantic cluster, the primary term identification being determined based on the frequency of occurrence of each term within the semantic cluster in a topic-based corpus; and replacing synonymous or near-sense terms in the cluster by the main terms to form a standardized term set.
  5. 5. The system for managing standardized data based on RAG and large model warehouse of claim 1 wherein the preliminary examination module is configured to retrieve the standard subject library based on an external task request, and generate a first retrieval result of the task request, comprising: Carrying out semantic analysis on the received external task request, extracting task intention vectors, and determining a corresponding standard subject library; Searching a corresponding standard subject library based on the task intention vector to obtain an initial candidate set; And generating structured answer content based on the initial candidate set, wherein the structured answer content is a first search result.
  6. 6. The system for managing standardized data based on RAG and large model warehouse of claim 5 wherein the performing semantic parsing of the received external task request, extracting task intent vectors, and determining the corresponding standard subject library comprises: performing text preprocessing on an external task request, wherein the text preprocessing comprises word segmentation, stop word removal and part-of-speech tagging; extracting context semantic features of the preprocessed text, and identifying semantic components and key entities in the task request; Vectorizing the semantic components and the key entities to generate semantic vectors, wherein the semantic vectors are used as task intention vectors; and based on the task intention vector, similarity calculation is carried out on semantic vectors of all topics in the standard topic library, and the corresponding standard topic library is determined.
  7. 7. The RAG and large model warehouse based standardized data management system of claim 6, wherein the generating structured answer content based on the initial candidate set, the structured answer content being a first search result comprises: Carrying out semantic coding on the initial candidate set to generate candidate semantic vectors; fusing the candidate semantic vectors and task intention vectors to obtain fused semantic representations; and generating content of the fused semantic representation, and generating structured answer content, wherein the structured answer content is a first search result.
  8. 8. The RAG and large model warehouse based standardized data management system of claim 1 wherein validating the inference results using the first search results as boundary conditions to obtain second search results comprises: constructing a constraint set based on the first search result; Verifying the reasoning result by using the constraint set; when the reasoning result meets the constraint set, confirming that the reasoning result is effective and outputting the reasoning result as a second retrieval result; and when the reasoning result deviates from the constraint set, carrying out self-adaptive correction on the reasoning result based on the deviation degree, and outputting a second search result after updating.

Description

Standardized data management system based on RAG and large model warehouse Technical Field The invention relates to the technical field of data management, in particular to a standardized data management system based on RAG and a large model warehouse. Background With the rapid development of big data and artificial intelligence technology, enterprises and institutions accumulate a large amount of data with various sources, complex formats and inconsistent semantics. Such data is widely distributed across different business systems and databases, including both structured data, such as tables, records, and the like, and unstructured data, such as documents, pictures, and log files, and the like. Because the data standards are not uniform and the semantic expression difference is obvious, the traditional data management system has obvious bottleneck in the aspects of realizing the fusion, standardized management and knowledge unification of cross-source data. In the prior art, data management mainly relies on modes such as rule matching and static body alignment to carry out semantic mapping and standardization processing, but high-precision learning reconstruction and semantic reasoning cannot be realized when a dynamic evolution business knowledge system and a complex external task request are faced. Meanwhile, the traditional model calling mechanism is single, and the multi-mode collaborative reasoning and self-adaptive enhancement capability is lacked, so that the problems of low result accuracy and poor response efficiency exist when the system responds to complex task requests (such as cross-domain data analysis, intelligent question-answering and treatment decision-making). In recent years, the development of a retrieval enhancement generation (RETRIEVAL-Augmented Generation, which is called RAG for short) technology provides a new idea for the knowledge calling and generation in data management. RAG can realize dynamic information retrieval and semantic enhancement generation in a large-scale knowledge base by combining the advantages of information retrieval and generation type models. However, the existing RAG application is mostly limited to a single-model architecture, lacks a deep fusion mechanism with a localization model warehouse, and is difficult to support complex and multi-scene data management tasks. Disclosure of Invention Object of the invention The invention aims to provide a standardized data management system based on RAG and a large model warehouse, which is used for forming a first search result by constructing a standard subject warehouse, verifying the result formed by RAG enhancement and a model warehouse module as a boundary condition, improving the accuracy of a data generation result and preventing model reasoning deviation and error knowledge diffusion. (II) technical scheme In order to solve the problems, the invention provides a standardized data management system based on RAG and large model warehouse, which comprises a data fusion module, a standard knowledge evolution module, a primary detection module and a RAG enhancement and model warehouse module; The data fusion module is used for extracting structured and unstructured data from a plurality of distributed databases and obtaining a multi-source unified semantic data set based on a cross-source semantic alignment algorithm; the standard knowledge evolution module is used for carrying out standard semantic reconstruction on the semantic data set by utilizing a preset corpus algorithm to form a standard subject library; The primary detection module is used for searching the standard subject library based on an external task request to generate a first search result of the task request; the RAG enhancement and model warehouse module is used for carrying out reasoning generation based on a preset multi-mode model, an external task request and the first search result to obtain a second search result. In another aspect of the present invention, preferably, the data fusion module extracts structured and unstructured data from a plurality of distributed databases, and obtains a multi-source unified semantic data set based on a cross-source semantic alignment algorithm, including: performing connection configuration on a plurality of distributed databases, and establishing a data access channel; performing field mapping and primary key identification on the structured data to generate a preliminary structured index table; Extracting features of the unstructured data, and extracting semantic vectors by using a text embedding model; Based on a cross-source semantic alignment algorithm, carrying out semantic aggregation matching on the structured index table and the semantic vector to obtain a cross-source entity alignment result; And according to the cross-source entity alignment result, fusing and generating a multi-source semantic data set under the unified semantic space. In another aspect of the present invention, preferabl