CN-122019580-A - Grammar tree constraint decoding optimization method and system of NL2SQL in ChatDB

CN122019580ACN 122019580 ACN122019580 ACN 122019580ACN-122019580-A

Abstract

The invention relates to the technical field of natural language processing and databases, in particular to a method and a system for optimizing the constraint decoding of a grammar tree of NL2SQL in ChatDB, which comprises the following steps of S1, receiving natural language query, and matching with metadata of a database to generate an initial candidate field set; S2, starting a large-scale language model, adopting a preset SQL grammar tree as a decoding constraint, calculating semantic confidence scores for each candidate field in an initial candidate field set when decoding to grammar tree nodes of the field to be filled, S3, calculating a dynamic pruning threshold, S4, generating a refined candidate field subset, S5, providing the refined candidate field subset for the large-scale language model, completing filling of the current node under the constraint of the SQL grammar tree, and decoding until a complete SQL query statement is generated.

Inventors

TANG KEWEI
CHEN SHENGHONG
QIU PENGFEI

Assignees

浙江孚临科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (8)

The grammar tree constraint decoding optimization method of NL2SQL in ChatDB is characterized by comprising the following steps: s1, receiving natural language query, and matching with database metadata to generate an initial candidate field set; S2, starting a large language model, adopting a preset SQL grammar tree as a decoding constraint, and calculating semantic confidence scores for each candidate field in an initial candidate field set when decoding to grammar tree nodes of the field to be filled; S3, calculating a dynamic pruning threshold according to the query nesting depth in the decoding process; S4, screening the initial candidate field set based on the semantic confidence score and the dynamic pruning threshold value to generate a refined candidate field subset; s5, providing the refined candidate field subset for the large language model, completing filling of the current node under the constraint of the SQL grammar tree, and continuing decoding until a complete SQL query statement is generated.
2. The method for optimizing syntax tree constraint decoding of NL2SQL in ChatDB according to claim 1, wherein S1 specifically comprises: s11, carrying out vector similarity matching on entity index expressions in natural language query and metadata of all tables in a database by adopting a preset fuzzy column name mapper; s12, taking high recall as a target, and incorporating all potential relevant database fields into an initial candidate field set.
3. The method for optimizing syntax tree constraint decoding of NL2SQL of claim ChatDB of claim 2, wherein S2 specifically comprises: s21, combining the initial semantic similarity of each candidate field in the initial candidate field set with a quantized value of the dynamic logic adaptation degree of the candidate field in the current SQL structure; s22, taking the combined result as a semantic confidence score of the candidate field in the current decoding context.
4. A method for optimizing the constrained decoding of a syntax tree of NL2SQL according to claim 3, wherein the calculating of the semantic confidence score specifically comprises: S221, initial semantic similarity of the candidate fields is derived from vector cosine similarity calculated by the fuzzy column name mapper; S222, determining the quantized value of the dynamic logic adaptation degree by a pre-trained classification model according to the context suitability weight of the probabilistic rule output of the fields of different data types appearing in the specific SQL clause.
5. The method for optimizing syntax tree constraint decoding of NL2SQL in ChatDB according to claim 4, wherein S3 specifically comprises: S31, analyzing the decoding state of the large language model in real time, and acquiring the nesting depth of the SQL sentence generated currently; s32, adopting a threshold model with boundary growth, taking query complexity as input, and calculating a dynamic pruning threshold value, so that the severity of a pruning strategy is linearly increased along with the query complexity.
6. The method for optimizing syntax tree constraint decoding of NL2SQL of claim ChatDB, wherein S4 specifically comprises: S41, traversing the initial candidate field set, and reserving all candidate fields with semantic confidence scores greater than or equal to a dynamic pruning threshold value to generate a refined candidate field subset; s42, after screening is executed, if the refined candidate field subset is empty, discarding the threshold filtering, and directly selecting a single candidate field with the highest semantic confidence score from the initial candidate field set as a unique candidate.
7. The method for optimizing syntax tree constraint decoding of NL2SQL of claim ChatDB of claim 6, wherein S5 specifically comprises: S51, providing the refined candidate field subset as a limited candidate range to a large language model decoder; s52, the decoder completes filling of the current node by combining the constraint of the SQL grammar tree in a limited range, and continues the decoding process of the subsequent node.
A syntax tree constraint decoding optimization system of NL2SQL in chatdb, based on the syntax tree constraint decoding optimization method of NL2SQL in ChatDB of any one of claims 1-7, comprising: The natural language input interface is used for receiving natural language query of a user; the candidate field generation module is used for generating an initial candidate field set based on natural language query and database metadata; The grammar tree constraint decoding engine is used for embedding a large language model and is responsible for structured generation of SQL according to a preset SQL grammar tree; The pruning parameter calculation module is used for calculating semantic confidence score and dynamic pruning threshold value when the decoding engine decodes to the grammar tree node of the field to be filled; The dynamic semantic pruning module is used for receiving the semantic confidence score and the dynamic pruning threshold, performing screening on the initial candidate field set to generate a refined candidate field subset, and returning the refined candidate field subset to the decoding engine; and the SQL output module is used for outputting the complete SQL query statement generated by the decoding engine.

Description

Grammar tree constraint decoding optimization method and system of NL2SQL in ChatDB Technical Field The invention relates to the technical field of natural language processing and database technology, in particular to a method and a system for constraint decoding and optimization of a grammar tree of NL2SQL in ChatDB. Background Along with the increasing wide application of natural language processing technology in the field of database query, how to accurately convert the natural language of a user into Structured Query Language (SQL) becomes a key technology. In existing conversion techniques, the system typically matches and introduces a large number of database fields that may be relevant in order to guarantee a high recall of field intent identification. This strategy, while ensuring that potentially correct fields are not missed, introduces a significant amount of semantic noise. The problem has limited influence when processing simple queries, but when processing queries containing complex logic such as nesting, the number of candidate fields can be increased explosively, so that the accuracy of generating SQL is obviously reduced, and even the illusion of logic errors is generated. Therefore, how to solve the contradiction that semantic noise is introduced to ensure high recall rate, and further, SQL generation accuracy is reduced under a complex query scene due to combined explosion effect becomes a technical problem to be solved in the field. Disclosure of Invention In order to solve the technical problems, the invention discloses a grammar tree constraint decoding optimization method and a grammar tree constraint decoding optimization system of NL2SQL in ChatDB, and specifically, the technical scheme of the invention is as follows: the grammar tree constraint decoding optimization method of NL2SQL in ChatDB comprises the following steps: s1, receiving natural language query, and matching with database metadata to generate an initial candidate field set; S2, starting a large language model, adopting a preset SQL grammar tree as a decoding constraint, and calculating semantic confidence scores for each candidate field in an initial candidate field set when decoding to grammar tree nodes of the field to be filled; S3, calculating a dynamic pruning threshold according to the query nesting depth in the decoding process; S4, screening the initial candidate field set based on the semantic confidence score and the dynamic pruning threshold value to generate a refined candidate field subset; s5, providing the refined candidate field subset for the large language model, completing filling of the current node under the constraint of the SQL grammar tree, and continuing decoding until a complete SQL query statement is generated. Preferably, S1 specifically includes: s11, carrying out vector similarity matching on entity index expressions in natural language query and metadata of all tables in a database by adopting a preset fuzzy column name mapper; s12, taking high recall as a target, and incorporating all potential relevant database fields into an initial candidate field set. Preferably, S2 specifically includes: s21, combining the initial semantic similarity of each candidate field in the initial candidate field set with a quantized value of the dynamic logic adaptation degree of the candidate field in the current SQL structure; s22, taking the combined result as a semantic confidence score of the candidate field in the current decoding context. Preferably, the calculating of the semantic confidence score specifically includes: S221, initial semantic similarity of the candidate fields is derived from vector cosine similarity calculated by the fuzzy column name mapper; S222, determining the quantized value of the dynamic logic adaptation degree by a pre-trained classification model according to the context suitability weight of the probabilistic rule output of the fields of different data types appearing in the specific SQL clause. Preferably, S3 specifically includes: S31, analyzing the decoding state of the large language model in real time, and acquiring the nesting depth of the SQL sentence generated currently; s32, adopting a threshold model with boundary growth, taking query complexity, and calculating a dynamic pruning threshold by taking index quantification such as nesting depth as input, so that the severity of a pruning strategy is linearly increased along with the query complexity. Preferably, S4 specifically includes: S41, traversing the initial candidate field set, and reserving all candidate fields with semantic confidence scores greater than or equal to a dynamic pruning threshold value to generate a refined candidate field subset; s42, after screening is executed, if the refined candidate field subset is empty, discarding the threshold filtering, and directly selecting a single candidate field with the highest semantic confidence score from the initial candidate field set as a unique candi