CN-121722785-B - NL2SQL method based on external key completion and knowledge graph

CN121722785BCN 121722785 BCN121722785 BCN 121722785BCN-121722785-B

Abstract

The invention belongs to the technical field of database query and natural language processing, and provides an NL2SQL method based on external key completion and a knowledge graph, which comprises the steps of extracting candidate external key relations from database metadata and data content, pruning and completing the candidate external key relations through a three-layer funnel verification mechanism, and constructing a Schema knowledge graph; the method comprises the steps of determining a seed table set directly related to query intention based on natural language query, calculating connection paths among seed table nodes based on the Schema knowledge graph, complementing intermediate tables on the paths, expanding the seed table into a sub-table set containing complete connection relations, generating a plurality of candidate SQL sentences based on the sub-table set, executing the candidate SQL sentences, voting according to consistency of execution results, and selecting a final SQL sentence.

Inventors

LUO YE
SUN TONG
Wang Yituan
ZHOU ZHENGYAN
ZHOU LI

Assignees

厦门大学
阿米智能(厦门)科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260225

Claims (6)

1. An NL2SQL method based on external key completion and knowledge graph, wherein the method comprises: extracting candidate foreign key relations from database metadata and data contents, pruning and complementing the candidate foreign key relations through a three-layer funnel checking mechanism, and constructing a Schema knowledge graph, wherein the three-layer funnel checking mechanism comprises: The physical layer verifies whether the uniqueness index of the candidate external key relation reaches a preset threshold value or not by executing uniqueness statistics operation; The lexical rule layer is used for carrying out word segmentation and heuristic semantic matching check on the candidate external key list names passing through the physical layer; The semantic layer is used for inputting the candidate foreign key relation still blurred through the first two layers into a large language model, carrying out semantic arbitration based on column names, table notes and sample data, and determining the finally reserved foreign key relation; Extracting candidate external key relations from database metadata and data contents, pruning and complementing the candidate external key relations through a three-layer funnel verification mechanism, and constructing a Schema knowledge graph, wherein the method comprises the following steps of: reading the metadata of the database through a data access module, and obtaining Schema information of the database; Introducing an external dependency discovery tool for Inclusion Dependency analysis, and scanning the full data content to identify cross-table reference relationships between columns; generating a candidate external key relation set, and converting the Schema information into a graph structure; based on natural language query, determining a seed table set directly related to the query intention by a two-way mixed recall mechanism through vector similarity and large model semantic judgment; Calculating connection paths among the nodes of the seed table based on the Schema knowledge graph, complementing an intermediate table on the paths, and expanding the seed table into a sub-graph set containing complete connection relations; Generating a plurality of candidate SQL sentences based on the sub-graph set; Executing the plurality of candidate SQL sentences, and voting to select a final SQL sentence according to the consistency of the execution results.
2. The method of claim 1, wherein calculating connection paths between nodes of the seed table based on the Schema knowledge graph, complementing intermediate tables on paths, expanding the seed table into a sub-graph set containing complete connection relationships, comprises: And calculating the shortest path based on the outer key edge between the seed table nodes, and incorporating all intermediate bridging tables related to the shortest path into the sub-graph set.
3. The method of claim 1, wherein generating a plurality of candidate SQL statements based on the set of sub-graphs comprises: and inputting the Schema information corresponding to the sub-graph set into a large language model to generate a plurality of candidate SQL sentences.
4. The method of claim 1, wherein executing the plurality of candidate SQL statements and voting for a final SQL statement based on consistency of execution results comprises: Executing the plurality of candidate SQL sentences, and grouping the candidate SQL sentences according to the consistency of the execution results; If the unique multi-dispatch execution result exists, outputting the corresponding candidate SQL statement as a final result.
5. The method of claim 4, wherein if multiple dispatching results are not generated or it is determined that all the candidate SQL statements fail to be executed, triggering a dynamic trace-back signal, expanding a range of connection path searching in the Schema knowledge graph, re-performing table recall and SQL generation, and performing multiple iterative optimization.
6. The method of claim 5, wherein executing the plurality of candidate SQL statements and voting for a final SQL statement based on consistency of execution results comprises: Executing each candidate SQL sentence, and if the execution fails, acquiring an error log; Correcting the candidate SQL sentence by using a large language model based on the error log to generate a corrected SQL sentence; Replacing the candidate SQL sentence with the modified SQL sentence for re-execution, and recording the repairing iteration times; judging whether the repair iteration times reach a preset maximum repair time threshold value or not; And if the maximum repair time threshold is reached, judging that the candidate SQL sentence fails to be executed, and the execution result of the candidate SQL sentence does not include the consistency voting.

Description

NL2SQL method based on external key completion and knowledge graph Technical Field The disclosure belongs to the technical field of database query and natural language processing, and particularly relates to an NL2SQL method based on external key completion and knowledge graph. Background In the current wave of digital transformation, massive amounts of service data are stored in database systems with increasingly complex structures. NL2SQL (Natural Language to SQL) technology automatically converts natural language into executable SQL, aiming at lowering database usage thresholds. The current mainstream scheme relies on the "Schema linking" idea of a Large Language Model (LLM) that database Schema and user questions are spliced into a Prompt, and SQL is directly generated by the LLM. The solution has remarkable effect on a small database with a standard structure, and forms the basis of a plurality of data analysis tools. However, when applied to real, large-scale, non-standardized enterprise-level databases, the solution faces serious downslide in terms of efficiency and accuracy, which becomes a key bottleneck for technology landing. The specific technical defects include: 1. passive Schema understanding leads to reasoning bottlenecks. When the explicit foreign key is missing, LLM is difficult to infer the implicit business association, and multi-table JOIN queries are easy to illusion. 2. The context length contradicts noise dually. The full-amount Schema is ultra-long and contains a large number of irrelevant tables, the Token cost is high, the attention is dispersed, and the accuracy rate is not increased and reduced. 3. The cross-domain generalization is poor and the customization cost is high. The database needs to be redesigned or fine-tuned every time, so that the database cannot be plugged and used, and large-scale landing is prevented. Disclosure of Invention The invention provides an NL2SQL method based on external key completion and a knowledge graph, which can effectively solve the problems. The present disclosure is implemented as follows: the disclosure provides an NL2SQL method based on external key completion and knowledge graph, comprising the following steps: Extracting candidate external key relations from the metadata and the data content of the database, pruning and complementing the candidate external key relations through a three-layer funnel verification mechanism, and constructing a Schema knowledge graph; Determining a set of seed tables directly related to query intent based on the natural language query; Calculating connection paths among the nodes of the seed table based on the Schema knowledge graph, complementing an intermediate table on the paths, and expanding the seed table into a sub-graph set containing complete connection relations; Generating a plurality of candidate SQL sentences based on the sub-graph set; Executing the plurality of candidate SQL sentences, and voting to select a final SQL sentence according to the consistency of the execution results. Compared with the prior art, the beneficial effects of the present disclosure are: the NL2SQL method based on the external key completion and the knowledge graph actively extracts and accurately completes the hidden external key relation from the metadata and the data content of the database through a three-layer funnel verification mechanism, builds a high-precision Schema knowledge graph, effectively overcomes the dependence of the existing method on external key deficiency, improves the accuracy of multi-table associated query and the implicit service logic recognition capability, accurately recalls intention related seed table based on natural language query, avoids noise and redundancy of full-scale Schema input, greatly reduces the context complexity, improves the query efficiency and cross-domain generalization, automatically calculates a connection path through the knowledge graph and completes a middle bridging table to form a complete sub-graph, solves the problem of path omission in a complex JOIN scene, enables SQL to be more reliably and reliably executed in a non-normalized database, inputs the refined sub-graph into a large language model to generate a plurality of candidate SQL, reduces random errors generated in a single time, finally executes candidate sentences and votes through result consistency, forms a closed loop verification and optimization mechanism, remarkably improves the accuracy, the executable and the system robustness of the final SQL, and is particularly suitable for a complex-level large-scale query scene of enterprises. Drawings Fig. 1 is a flowchart of a complexity-priority dual-drive intelligence processing task scheduling method S100 provided in an embodiment of the present disclosure. Fig. 2 is a flowchart for constructing a database Schema knowledge graph according to an embodiment of the present disclosure. FIG. 3 is a flow chart of SQL generation and reasoning based on dynam