CN-122019563-A - Modularized NL2SQL controllable conversion method and system for complex query
Abstract
The invention provides a modularized NL2SQL controllable conversion method and a system for complex query, which belongs to the technical field of natural language processing and database query intersection, wherein the method comprises the following steps of S1, decomposing a natural language problem into atomic subtasks to construct a task dependency graph; the method comprises the steps of S2, dynamically searching and constructing sub-modes for each atomic subtask, S3, dispatching SQL to generate expert models for each atomic subtask and the sub-modes corresponding to each atomic subtask to generate atomic SQL query sentences, S4, scheduling and executing the atomic SQL query sentences according to task dependency graphs, storing execution results of the independent atomic subtasks as intermediate results, inputting the dependent intermediate results as parameters for the dependent atomic subtasks to a result integrator to generate complex SQL query sentences, and S5, executing the complex SQL query sentences. The method has the advantages that the accuracy, logic controllability, multi-step task processing capability and overall system interpretability and maintainability of SQL generation are greatly improved.
Inventors
- MIAO XIAOLI
- WANG LEI
- JU YANFENG
Assignees
- 福建新大陆软件工程有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251218
Claims (10)
- 1. A modular NL2SQL controllable conversion method facing complex query is characterized by comprising the following steps: S1, receiving a natural language question Q input by a user, identifying a core analysis entity in the natural language question Q through an intention analysis model, decomposing the natural language question Q into a series of atomic subtasks with logic dependency relationships based on the core analysis entities, and constructing a task dependency graph with definite data flow and execution sequence based on the atomic subtasks; Step S2, aiming at each atomic subtask, dynamically retrieving and constructing a sub-mode from a global mode library through a mode matcher based on keywords of the atomic subtask; step S3, assigning SQL generating expert models to each atomic subtask and the corresponding sub-mode thereof to generate an atomic SQL query statement, and performing instant verification on grammar, mode and feasibility of the atomic SQL query statement to form a generation-verification closed loop; Step S4, scheduling and executing an atomic SQL query statement according to the task dependency graph to obtain an execution result, storing the execution result of the atomic subtask without dependency as an intermediate result, and inputting the intermediate result on which the atomic subtask with dependency depends as a parameter to a result integrator to generate a complex SQL query statement capable of referring to or internalizing the intermediate result; and S5, executing the complex SQL query statement to obtain a final query result, and generating a natural language answer containing service insight through a service interpretation generation model by combining the natural language question Q and a task decomposition and execution process log.
- 2. The method for controllable conversion of modular NL2SQL for complex queries according to claim 1, wherein in step S1, the intent analysis model is an instruction-fine-tuned special large language model, and the operations performed by the method comprise: Identifying a core analysis entity; decomposing the natural language problem Q into logically coherent and independently processable atomic subtasks; and constructing a task dependency graph for guiding the subsequent step-by-step execution and result merging logic.
- 3. The method of claim 1, wherein in step S2, the dynamic search is implemented by means of vector similarity search or metadata index to precisely locate data tables, fields and primary foreign key relationships most relevant to an atomic subtask from the global schema library, thereby constructing the subtmode.
- 4. The method for controllable conversion of modular NL2SQL for complex queries according to claim 1, characterized in that in step S3, the instant verification comprises: performing grammar verification through an SQL parser; checking whether a table name and a field name quoted in the atomic SQL query statement exist in a sub-mode or not; performing feasibility verification by generating a query execution plan; and when verification fails, error information is fed back to the SQL generating expert model to require correction or to trigger manual intervention rules.
- 5. The method of claim 1, wherein in step S4, the result integrator is a large language model or rules engine that is adept at processing data comparison and filtering logic.
- 6. A modular NL2SQL controllable conversion system facing complex query is characterized by comprising the following modules: The natural language problem decomposition module is used for receiving a natural language problem Q input by a user, identifying a core analysis entity in the natural language problem Q through an intention analysis model, decomposing the natural language problem Q into a series of atomic subtasks with logic dependency relationships based on the core analysis entity, and constructing a task dependency graph with definite data flow and execution sequence based on the atomic subtasks; the sub-pattern construction module is used for dynamically searching and constructing sub-patterns from a global pattern library by a pattern matcher based on keywords of the atomic sub-tasks; The atomic SQL query statement generation module is used for dispatching an SQL generation expert model to generate an atomic SQL query statement for each atomic subtask and the corresponding sub-mode thereof, and carrying out instant verification on grammar, mode and feasibility of the atomic SQL query statement to form a generation-verification closed loop; the complex SQL query statement generation module is used for scheduling and executing an atomic SQL query statement according to the task dependency graph to obtain an execution result, storing the execution result of the atomic subtask without dependency as an intermediate result, and inputting the intermediate result on which the atomic subtask with dependency depends as a parameter to a result integrator to generate a complex SQL query statement capable of referring to or internalizing the intermediate result; And the natural language answer generation module is used for executing the complex SQL query statement to obtain a final query result, and generating a natural language answer containing business insight through a business interpretation generation model by combining the natural language question Q and a task decomposition and execution process log.
- 7. The modular NL2SQL controllable transformation system of claim 6 wherein the intent resolution model is a large specialized language model with instruction hints, comprising: Identifying a core analysis entity; decomposing the natural language problem Q into logically coherent and independently processable atomic subtasks; and constructing a task dependency graph for guiding the subsequent step-by-step execution and result merging logic.
- 8. The modular NL2SQL controllable conversion system of claim 6 wherein said sub-schema building module is configured to perform said dynamic search by vector similarity search or metadata indexing to precisely locate data tables, fields and primary foreign key relationships most relevant to atomic sub-tasks from said global schema library to build said sub-schema.
- 9. The modular NL2SQL controllable conversion system of claim 6, wherein the atomic SQL query statement generation module, the instant verification comprises: performing grammar verification through an SQL parser; checking whether a table name and a field name quoted in the atomic SQL query statement exist in a sub-mode or not; performing feasibility verification by generating a query execution plan; and when verification fails, error information is fed back to the SQL generating expert model to require correction or to trigger manual intervention rules.
- 10. The modular NL2SQL controllable transformation system of claim 6 wherein the result integrator is a large language model or rules engine that is adept at processing data comparison and filtering logic.
Description
Modularized NL2SQL controllable conversion method and system for complex query Technical Field The invention relates to the technical field of natural language processing and database query intersection, in particular to a modularized NL2SQL controllable conversion method and system for complex query. Background With the deep development of the big data age, enterprises and organizations accumulate massive amounts of structured data, which are stored in relational databases (e.g., mySQL, postgreSQL) or data warehouses. How to make business personnel of non-technical background directly use natural language to inquire data becomes a key challenge for improving the data driving decision-making efficiency. Natural language to structured query language (NL 2 SQL) conversion techniques have evolved, with the core goal of building a system that can understand user natural language questions and automatically generate corresponding SQL queries. The technology is in the crossing field of Natural Language Processing (NLP) and database query technology, and is an important bridge for promoting data democratization. The development of NL2SQL technology has undergone an evolution from traditional methods to large model-based, but there is still a significant bottleneck in facing complex query scenarios. 1. Early methods based on rules and statistical models: The rule template method relies on the mapping relation between the manually predefined problem pattern and the SQL template. The method has the advantages that the generation result is controllable, but the flexibility is extremely poor, the method can not adapt to various expressions of natural language, and the development and maintenance workload is huge. The statistical machine learning method is used for learning the mapping relation through the historical question-answer pair training model, so that the generalization capability is improved to a certain extent. However, the performance of the method is severely limited by the coverage of training data, and the accuracy of SQL generation is difficult to ensure for unseen table structures, field relations or complex logic generalization capability. 2. Modern methods based on Large Language Models (LLM): LLM represented by GPT, chatGLM, religion and the like remarkably improves the accuracy of NL2SQL by virtue of strong semantic understanding and generating capability. By prompting engineering or fine tuning, LLM can generate basically correct syntax SQL, which is excellent in WikiSQL, spider and other benchmark tests. However, in a real enterprise environment, facing complex queries that require deep business logic reasoning, existing LLM schemes expose a series of drawbacks: The problem of illusion and controllability is that when a single LLM generates complex SQL at one time, table names, field names or relations which do not exist are easy to compile, the generation logic is uncontrollable due to the black box characteristic, and the service correctness is difficult to guarantee. Context length and information overload-enterprise database schemas typically contain a large number of tables and fields, and full schema input tends to cause model distraction, generating redundant or erroneous queries. The interpretability and the debugging difficulty are that when the query result is wrong, the root cause of the problem (such as the understanding deviation of the intention, the missing of the connection condition and the like) cannot be traced, and the optimization process lacks basis. The multi-step analysis request processing capability is insufficient, and for compound problems to be executed step by step (such as 'comparing sales of regions in this month and last month and finding out the category which grows fastest'), the existing scheme either generates a single SQL which is too deeply nested and is easy to make mistakes, or cannot be processed at all, and lacks a system mechanism for task decomposition and result integration. In summary, the main contradiction of the current NL2SQL technology is that the large model has strong semantic understanding capability, but the end-to-end generation mode is difficult to adapt to the controllability and the interpretability of complex queries and the requirement of multi-step logic processing. Enterprise-level applications require that the system not only generate accurate SQL, but also provide verifiable intermediate processes, pluggable domain knowledge integration capability, and business user oriented visual interpretation. The prior art lacks of modularized design, and cannot realize 'divide and conquer' refinement treatment, so that the reliability of the method in a real scene is limited. Therefore, how to provide a modularized NL2SQL controllable conversion method and system for complex query, which can improve the accuracy, logic controllability, multi-step task processing capability and overall system interpretability and maintainability of SQL generat