CN-122019734-A - Database metadata information processing method and system based on RAG and large model

CN122019734ACN 122019734 ACN122019734 ACN 122019734ACN-122019734-A

Abstract

The invention discloses a database metadata information processing method and system based on RAG and a large model, which relate to the technical field of artificial intelligence and have the technical scheme that the invention carries out semantic analysis on legal documents and business data through the large model, and automatically extracts and constructs a sensitive tag knowledge base; the method utilizes a large model to carry out semantic understanding on field names and context information, realizes automatic sensitivity reasoning of remarked fields, further introduces RAG technology, combines multi-source information such as field values, structures, historical data and the like to carry out semantic complement and enhancement recognition, can improve recognition accuracy, generalization capability and interpretability, is suitable for complex and changeable data scenes, and provides a feasible path for the large language model to land in the field of sensitive recognition.

Inventors

WU LEI
LIU JIAN
ZHANG YUFAN
WANG HEYU
DIAO YANGJIE

Assignees

天府绛溪实验室

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. The database metadata information processing method based on the RAG and the large model is characterized by comprising the following steps of: Constructing a standard knowledge base based on legal requirements and field service experience information, wherein the standard knowledge base comprises a tag base, a sensitive identification definition, a value range description, a general service term interpretation and an interpretability knowledge; extracting meta information of a data table to be processed, wherein the meta information at least comprises a data table name, field annotation information, a field data type and an association relation between a field and an external dictionary/mapping table; After the data sheet to be processed is formatted and input, a large language model is called to execute semantic restoration or translation operation, so that the understanding and correction of the context semantics of the data sheet are realized; Marking and classifying the semantically processed data table based on the standard knowledge base, combining the search enhancement generation technology to search rule and term fragments, industry guide files and historical compliance records, and fusing the structured evidence chain to generate standard compliance record labels; and carrying out structuring processing and vectorization representation on the data subjected to hierarchical classification, generating a structured vector result related to compliance labels, compliance reasons and evidence references, and writing the result into a data directory and carrying out decision processing.
2. The RAG and large model based database metadata information processing method according to claim 1, wherein the constructing a standard knowledge base comprises: analyzing the rule and regulation by using a large language model, automatically extracting key words and mapping the key words to a tag library; Formalized definition is carried out on the sensitive identification and the value range to form an expandable sensitive feature library; and establishing association relations among legal terms, business terms and sensitive labels through a knowledge graph technology.
3. The RAG and large model based database metadata information processing method of claim 1, wherein the invoking the large language model to perform semantic restoration or translation operations comprises: Based on the general service term explanation and the value range explanation in the standard knowledge base, carrying out context semantic modification on the field annotation and the non-standardized expression with ambiguity in the data table; and carrying out semantic disambiguation on the naming anomaly field by combining the field type and the historical data distribution characteristics.
4. The RAG and large model based database metadata information processing method according to claim 1, wherein the tagging and hierarchical classification process comprises: Performing multi-source information retrieval by a retrieval enhancement generation technology, wherein the multi-source information comprises a rule applicable segment in a rule knowledge base, service classification and link information in an industry compliance guiding base, field-level label marking information of a history compliance record and a structured table of similar service scenes; Carrying out fusion processing on the information obtained by retrieval to generate a structured evidence package, wherein the structured evidence package comprises context compliance evidence, value domain constraint evidence, service similarity matching evidence and mapping relations of compliance standard items of metadata; and generating and classifying hierarchical judgment based on the structured evidence package completion label.
5. The RAG and large model based database metadata information processing method according to claim 1, wherein the structuring process and vectorizing representation comprises: performing format conversion, missing value filling and business logic complementation on unstructured fields; Generating a high-dimensional feature vector by adopting an embedded model and combining a compliance label, a field association relation and evidence reference; and realizing the rapid retrieval and library dropping of the structured vector result by a hash index technology.
6. The method for processing database metadata information based on RAG and large model according to claim 5, wherein the embedded model is constructed based on a transducer architecture.
7. The RAG and large model based database metadata information processing method according to claim 1, wherein the method further comprises: Inputting the structured evidence package into a large language model reasoning module, and enabling the large language model to output a candidate label, a compliance grading result, a judgment reason explanation, an evidence quotation mark and a judgment confidence coefficient based on the evidence package; Executing policy and routing operation, and determining a data processing policy through threshold automatic judgment, end-side and cloud-side cooperative scheduling, multi-model routing, gray level release mechanism, AB test shunting and cost constraint evaluation dimensions; when the confidence coefficient of the reasoning result is in a preset low confidence interval, triggering a low confidence coefficient branch processing flow, and executing a re-judging operation or organization field expert collaborative multi-disc judging logic.
8. The RAG and large model based database metadata information processing method of claim 7, further comprising: constructing a continuous learning closed-loop mechanism, recharging erroneous judgment samples in service judgment to a knowledge base to finish incremental updating of the knowledge base; And generating an interpretation and audit output result, wherein the output content comprises rule ID identification, an evidence chain complete track, a clause mapping association relation, responsibility tracing information and rechecking judgment path information of data source tracing information.
9. The RAG and large model based database metadata information processing method of claim 8, further comprising: And processing the updated knowledge base and model parameters by adopting a model distillation and quantization technology, and transmitting the updated content to an inference and rule engine in an over-the-air downloading mode.
10. A database metadata information processing system based on RAG and large model, comprising: the knowledge base construction module is configured to construct a standard knowledge base, wherein the standard knowledge base is constructed based on the rule requirements and the field service experience information and comprises a tag base, a sensitive identification definition, a value range description, a general service term interpretation and an interpretability knowledge; the meta information extraction module is configured to extract meta information of the data table to be processed, wherein the meta information at least comprises a data table name, field annotation information, a field data type and an association relation between a field and an external dictionary/mapping table; The large model calling module is configured to call a large language model to execute semantic restoration or translation operation after the data table to be processed is formatted and input, so that understanding and correction of context semantics of the data table are realized; the RAG retrieval module is configured to label and classify the semantically processed data table based on the standard knowledge base, and combines the retrieval enhancement generation technology to retrieve rule and term fragments, industry guide files and historical compliance records, and fuses the structured evidence chain to generate standard compliance record labels; The structuring processing module is configured to carry out structuring processing and vectorization representation on the data subjected to hierarchical classification, generate structuring vector results related with compliance labels, compliance reasons and evidence references, and write the results back to the data catalogue and decision processing.

Description

Database metadata information processing method and system based on RAG and large model Technical Field The invention relates to the technical field of artificial intelligence, in particular to a database metadata information processing method and system based on RAG and a large model. Background With the wide application of big data technology and the enhancement of data security supervision policies, database sensitive information identification has become a core requirement for data management in the fields of finance, government affairs, medical treatment and the like. Some regulations clearly require enterprises to classify and hierarchically manage stored data, but the traditional manual dominant sensitive label definition mode has the problems of low efficiency, strong subjectivity, low standardization degree and the like, so that the sensitive data has high omission rate and the compliance cost is increased dramatically. Meanwhile, history legacy problems such as irregular name of a table/field name, remark deletion, semantic ambiguity and the like commonly exist in a massive history database, so that the complexity of sensitive information identification is further increased. The prior art mainly relies on manual rule driving type, mapping table matching type and shallow model matching type technical schemes to realize database metadata information processing. The manual rule driving type is to define sensitive labels through manual interpretation of legal documents and business data to construct a knowledge base, and the method relies on expert experience, is high in labor cost, and is easy to generate subjective deviation due to lack of unified standards for label definition. The mapping table matching type is to match keywords to the library names, table names and field names based on a predefined dictionary or regular expression, but cannot process the fields without remarks or naming anomalies (such as pinyin abbreviations and Chinese and English mixed fields), and the rule base needs to be updated frequently and maintained, so that the mapping table matching type cannot adapt to the rapidly-changing data environment. The shallow model matching type is to adopt a keyword index, a regular expression or a traditional classification model for sensitive data identification, but is limited by feature singleness, complex semantic association (such as combining sensitive scenes by cross-table field values) is difficult to capture, and sensitive attributes cannot be dynamically inferred by combining contexts, so that the accuracy of sensitive data identification is insufficient. Therefore, research and design of a database metadata information processing method and system based on RAG and large model, which can overcome the defects, is a problem which we need to solve at present. Disclosure of Invention The invention aims to provide a database metadata information processing method and system based on RAG and a large model, which are used for carrying out semantic analysis on legal documents and business data through the large model, automatically extracting and constructing a sensitive tag knowledge base, carrying out semantic understanding on field names and context information by using the large model to realize automatic sensitivity reasoning without remarking fields, further introducing RAG technology, carrying out semantic complementation and enhanced recognition by combining multi-source information such as field values, structures, historical data and the like, improving the accuracy, generalization capability and interpretability of recognition, being applicable to complex and changeable data scenes and providing a feasible path for the large language model in the field of sensitive recognition. The technical aim of the invention is realized by the following technical scheme: in a first aspect, a database metadata information processing method based on a RAG and a large model is provided, including the following steps: Constructing a standard knowledge base based on legal requirements and field service experience information, wherein the standard knowledge base comprises a tag base, a sensitive identification definition, a value range description, a general service term interpretation and an interpretability knowledge; extracting meta information of a data table to be processed, wherein the meta information at least comprises a data table name, field annotation information, a field data type and an association relation between a field and an external dictionary/mapping table; After the data sheet to be processed is formatted and input, a large language model is called to execute semantic restoration or translation operation, so that the understanding and correction of the context semantics of the data sheet are realized; Marking and classifying the semantically processed data table based on the standard knowledge base, combining the search enhancement generation technology to search rule and