CN-121996663-A - Text2SQL generation method and system based on mixed view retrieval

CN121996663ACN 121996663 ACN121996663 ACN 121996663ACN-121996663-A

Abstract

The invention discloses a Text2SQL generating method and system based on mixed view retrieval, which firstly performs data portrait mining on a database, generating an enhancement Schema description document describing list business semantics, enumeration value meanings and implicit association relations. And secondly, receiving natural language query of a user, generating a hypothetical SQL sentence, and acquiring a suggested retrieval set based on the enhanced Schema description document. And then constructing input constraint information containing the suggested retrieval set, and generating a plurality of paths of candidate SQL sentences. And finally, executing multiple paths of candidate SQL sentences in the database environment, verifying, outputting a result if the verification is passed, and triggering a correction flow if the verification is failed to generate corrected SQL sentences. The invention reduces the illusion of selecting list of large language model, and improves the recall rate of Schema element and the accuracy and robustness of SQL generation in complex inquiry.

Inventors

QIN ZHIWEI
SONG YANG
Wang Hezhidong

Assignees

杭州电子科技大学

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (8)

1. The Text2SQL generating method based on the mixed view retrieval is characterized by comprising the following steps of: S1, carrying out data portrait mining on a database to generate an enhancement Schema description document for describing list business semantics, enumeration value meanings and implicit association relations; S2, receiving natural language query of a user, generating a hypothetical SQL sentence, describing a document based on enhanced Schema, and acquiring a suggested retrieval set by combining vector similarity retrieval and graph structure propagation; s3, constructing input constraint information containing a suggestion retrieval set, generating an intermediate decision result for guiding an SQL generation model to perform Schema element screening, and generating a plurality of paths of candidate SQL sentences based on the intermediate decision result; And S4, executing multiple paths of candidate SQL sentences in the database environment, verifying based on consistency of execution results and non-empty states, outputting results if verification is passed, and triggering SQL correction flow based on error information and result states to generate corrected SQL sentences if verification is failed.
2. The Text2SQL generation method based on the mixed view search according to claim 1, wherein the step S1 is specifically implemented as follows: S11, traversing all tables and columns in a database, performing duplicate removal query operation to extract non-empty sample data, and constructing a column-level data fingerprint, wherein the column-level data fingerprint comprises a non-empty sample value set, sample value length distribution characteristics and the maximum value and the minimum value of a numerical column; s12, taking all list names in a database as global contexts, and inputting the global contexts and column-level data fingerprints into a semantic reasoning big language model; S13, writing the column name string similarity, the value distribution characteristics of column sampling data and the co-occurrence relation of table names in a database into an enhancement Schema description document; S14, identifying list names with meaning similarity larger than a set threshold on the basis of S13, generating comparison description containing mutual exclusion characteristics by using a semantic reasoning large language model through the identified list names, and writing the comparison description into an enhancement Schema description document to form a final enhancement Schema description document.
3. The Text2SQL generation method based on the mixed view search according to claim 2, wherein the step S2 is specifically implemented as follows: s21, inputting natural language query of a user into SQL to generate a large language model, and generating an hypothesized SQL sentence containing field keywords and logical operators; s22, splicing natural language query and hypothetical SQL sentences into query text, mapping the query text and the enhanced Schema description document into vector representations respectively, and obtaining initial relevance scores of Schema elements by calculating similarity among vectors to obtain semantic scores; s23, constructing a database Schema diagram based on a final enhanced Schema description document, wherein a table and a column are used as nodes, and a table column attribution relation and an external key relation are used as edges; S24, using the semantic score as initial energy of the node, running a PageRank algorithm on the Schema graph to perform score propagation, and finally obtaining graph propagation scores of all the nodes; s25, merging the semantic score and the graph propagation score, reordering the Schema elements, intercepting Top-K candidate elements, and defining the obtained candidate elements as a suggested retrieval set.
4. The Text2SQL generating method based on the mixed view search according to claim 3, wherein the PageRank algorithm is configured into a Schema graph propagation strategy combining semantics and structure, wherein initial weights of nodes are determined by semantic scores between natural language queries and Schema elements, and edge weights are dynamically adjusted according to relationship types among the Schema elements in the propagation process, so that joint screening of the Schema elements is realized.
5. The Text2SQL generation method based on the mixed view search according to claim 3, wherein the step S3 is specifically implemented as follows: S31, dividing candidate Schema elements into a main selection set and a search suggestion set, inputting SQL (structured query language) to generate a large language model, wherein the main selection set is derived from a set generated based on table names, column names and structural relations thereof; S32, constructing a prompt word containing a source tag, guiding SQL to generate a large language model priority trust main selection set, and judging the relevance of the retrieval suggestion set to obtain a Schema element retention mark and an exclusion mark; s33, the SQL generates a large language model to output an intermediate screening result containing a Schema element retention mark and an exclusion mark, and generates a first candidate SQL based on the intermediate screening result; Screening a table list related to the Schema description document and the natural language query based on a screening large language model according to the Schema description document and the natural language query, inputting the screened information and the Schema description document into the SQL generating large language model to generate a second candidate, inputting the screened information into the large language model to generate a third candidate, and combining the first candidate SQL, the second candidate SQL and the third candidate SQL as a multi-path candidate SQL sentence.
6. The method for generating Text2SQL based on mixed view retrieval according to claim 5, wherein the source tag is used for constructing a credibility level of the Schema elements, the priority identification of the credibility level of the Schema elements in the main selection set in the source tag is higher than the priority identification of the Schema elements in the suggested retrieval set, and the SQL generates a large language model to select and combine the Schema elements according to the credibility level.
7. The Text2SQL generation method based on the mixed view search according to claim 6, wherein the step S4 is specifically implemented as follows: s41, outputting the SQL if the execution result sets of the first candidate SQL, the second candidate SQL and the third candidate SQL are the same and are not empty; S42, if any SQL execution errors are reported or the result is null, extracting error information or null result state, and constructing an anti-thinking prompt word by combining natural language query; S43, determining the error type based on the anti-thinking prompt word and the result state, and generating a corrected SQL sentence by combining the anti-thinking prompt word and the error type through the SQL generation large language model.
8. A Text2SQL generation system based on mixed view retrieval, for implementing the Text2SQL generation method of any one of claims 1 to 7, comprising the following modules: The Schema description document generation module is used for carrying out data portrait mining on the database and generating an enhanced Schema description document for describing list business semantics, enumeration value meanings and implicit association relations; The suggestion search set generation module is used for receiving natural language query of a user, generating a hypothetical SQL sentence, describing a document based on enhanced Schema, and combining vector similarity search and graph structure propagation to obtain a suggestion search set; The multi-path candidate SQL sentence generating module is used for constructing input constraint information containing a suggested search set, generating an intermediate decision result for guiding the SQL generating model to carry out Schema element screening, and generating multi-path candidate SQL sentences based on the intermediate decision result; The SQL sentence correction and output module is used for executing multiple paths of candidate SQL sentences in the database environment, verifying based on the consistency of the execution results and the non-empty state, outputting the result if the verification is passed, and triggering the SQL correction flow based on the error information and the result state to generate a corrected SQL sentence if the verification is failed.

Description

Text2SQL generation method and system based on mixed view retrieval Technical Field The invention relates to the technical field of artificial intelligence and databases, in particular to a Text2SQL generating method and system based on mixed view retrieval. Background With the popularity of large language models, many complex tasks may be handled by large language models. SQL statements are a language for manipulating databases, however, due to the structuring and complexity of SQL statements, users are required to have some knowledge of the database, thereby increasing the threshold for non-technicians to access the database. The Text-to-SQL technology aims at automatically converting natural language problems of users into executable SQL sentences by using a large language model, improving the efficiency and accuracy of writing the SQL sentences by the users and enabling non-technicians to easily access databases. However, existing Text-to-SQL techniques, when faced with industry-complex real databases, face the following challenges. First, existing methods typically rely on sophisticated database documents (e.g., notes, foreign key constraints). However, in a real scene, the database often lacks documents, the column names have a fuzzy meaning, and a large number of undefined implicit foreign keys exist, so that a large language model is difficult to understand the structure of the database. Secondly, in order to cover the columns required by the complex query, the conventional method generally enlarges the search range, so that a large amount of irrelevant noise is introduced, and the simple selection of the list by means of a large language model tends to cause miss-selection. Thirdly, when the generating model processes multi-table connection and nested query, table selection errors or unmatched logic conditions are easy to generate, so that the generated SQL statement can obtain an error result or a null result after execution. The existing self-repairing mechanism depends on SQL grammar error reporting, but lacks effective detection means for SQL sentences which are successful in running and have error results. Disclosure of Invention The invention provides a Text2SQL generating method and system based on mixed view retrieval, which aim to solve the technical problem of the existing Text-to-SQL mentioned in the background. In order to achieve the above object, the present invention adopts the following technical scheme: In one aspect of the invention, a Text2SQL generating method based on mixed view retrieval is provided, and the method comprises the following steps: S1, carrying out data portrait mining on the database, namely combining a global topological structure and column sampling data of the database, and generating an enhanced Schema description document for describing list business semantics, enumeration value meanings and implicit association relations based on the database list structure information, column name characteristics and column sampling data. S2, receiving natural language query of a user, generating a hypothetical SQL sentence, and based on the enhanced Schema description document, combining vector similarity retrieval and a graph structure propagation algorithm to obtain Top-K candidate Schema elements and constructing a suggestion retrieval set. S3, constructing input constraint information containing a suggestion retrieval set, generating an intermediate decision result for guiding an SQL generation model to conduct Schema element screening, and generating a multi-path candidate SQL statement based on the intermediate decision result. And S4, executing the multi-path candidate SQL statement in a database environment, verifying based on consistency of an execution result and a non-empty state, outputting a result if the verification is passed, and triggering an SQL correction flow based on error information and a result state to generate a corrected SQL statement if the verification is failed. In one possible embodiment, the step S1 of collecting database data, deriving the meaning of table names, and generating an enhanced Schema description document includes: S11, traversing all tables and columns in the database, performing a duplicate removal query operation to extract non-empty sample data, and constructing a column-level data fingerprint, wherein the column-level data fingerprint comprises a non-empty sample value set, sample value length distribution characteristics and the maximum value and the minimum value of a numerical column. S12, taking all list names in the database as global contexts, and inputting the global contexts and column-level data fingerprints into a semantic reasoning big language model. And S13, writing the column name string similarity, the value distribution characteristics of the column sampling data and the co-occurrence relation of table names in the database into the enhancement Schema description document. S14, identifying list names with