CN-121996683-A - Method and device for distilling private data from natural language to structured query language
Abstract
The application discloses a method and a device for distilling private domain data from natural language to structured query language. The method comprises the steps of giving identities of a first large model with tool calling capability as problems to generate an agent, calling each tool function built in advance by the agent for generating the problems, generating a problem list based on a private domain database, updating the identities of the first large model to generate the agent for structured query sentences, calling each tool function by the agent for structured query sentences, generating a corresponding structured query sentence list based on the problem list and the private domain database, and training a second large model of small-scale parameters based on the problem list and the corresponding structured query sentence list to obtain a private domain data distillation model. Through the mode, the method can realize the conversion from natural language to the structured query language through the private data distillation model, reduces the calculation force, and more accurately converts the natural language into the structured query statement suitable for the private database, thereby improving the accuracy and the efficiency.
Inventors
- LI JIAXIANG
- GUO JIANLIN
- JIAN WEIDONG
- WANG CHAO
Assignees
- 深圳市有方科技股份有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251211
Claims (10)
- 1. A method of private data distillation from natural language to structured query language, comprising: the identity of a first large model with tool calling capability is given to an agent for generating problems, the agent for generating the problems calls each tool function built in advance, and a problem list is generated based on a private domain database; updating the identity of the first large model to generate an agent for a structured query statement, and calling each tool function by the structured query statement generating agent to generate a corresponding structured query statement list based on the problem list and the private database; and training a second large model of the small-scale parameters based on the problem list and the corresponding structured query statement list to obtain a private data distillation model.
- 2. The method of claim 1, wherein the problem-generating agent invokes each pre-built tool function to generate a list of problems based on a private database, comprising: responding to the input scene character prompting words, analyzing the scene character prompting words by the problem generating agent, and extracting scene character information; And calling each tool function to search in the private domain database independently based on the scene character information, and generating the problem list corresponding to the scene character prompt words.
- 3. The method of claim 1, further comprising, prior to generating the problem list: generating a plurality of initial problems generated by an agent for a single time for the problems, and calculating first similarity between the initial problems; Deleting the corresponding initial problem in response to the first similarity being higher than a first preset similarity threshold; And responding to the initial questions obtained after the multiple times of generation of the question generation agent reach a preset number, and constructing the question list based on the preset number of initial questions.
- 4. The method of claim 1, further comprising, after generating the problem list: converting each initial problem of the problem list into each semantic vector, and obtaining a first cluster based on a density clustering algorithm; Based on binary search framework and combining semantic verification of a large model, performing iterative optimization on the first cluster to obtain a cluster threshold; Updating the first clusters based on the cluster threshold to obtain a plurality of second clusters; And removing the duplication of each second polymer based on a preset semantic filtering algorithm and a preset expression filtering algorithm to obtain the duplication-removed problem list.
- 5. The method of claim 3, further comprising, after generating the structured query statement list: Updating the identity of the first large model as an opponent intelligent agent, calling each tool function by the opponent intelligent agent based on each structured query statement in the structured query statement list, and generating a plurality of simulation problems corresponding to the structured query statement based on the private domain database; Calculating a second similarity between each simulation problem and the corresponding initial problem; Responding to the second similarity being higher than a second preset similarity threshold, and taking the structured query statement and the corresponding initial question as a final question-answer pair; a training dataset is constructed based on all of the final question-answer pairs to train the second largest model.
- 6. The method of claim 1, further comprising, after deriving the private data distillation model: Responding to the input natural language query problem aiming at the private database, analyzing the natural language query problem by the private data distillation model, and outputting the corresponding structured query statement.
- 7. The method of claim 1, wherein the tool functions are encapsulated based on database interface tools, the tool functions comprising at least one or more of a database schema generation tool, a JSON field parsing tool, a selection execution tool, a database switching tool, a database information acquisition tool, a database name acquisition tool, and a number of public tools; before the identity of the first large model with tool calling capability is given to generate an agent for the problem, the method further comprises: declaring each of the tool functions at a call interface of the first large model to understand and call each of the tool functions.
- 8. A natural language to structured query language private data distillation apparatus comprising: The first generation module is used for giving the identity of the first large model with the tool calling capability to generate an agent for the problem, and the agent for generating the problem calls each tool function constructed in advance and generates a problem list based on a private database; The second generation module is used for updating the identity of the first large model to generate an agent for the structured query statement, the structured query statement generating agent calls each tool function, and a corresponding structured query statement list is generated based on the problem list and the private domain database; And the training module is used for training the second large model of the small-scale parameters based on the problem list and the corresponding structured query statement list to obtain a private data distillation model.
- 9. A computer device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the steps of the natural language to structured query language private data distillation method of any one of claims 1-7.
- 10. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the natural language to structured query language private data distillation method of any one of claims 1 to 7.
Description
Method and device for distilling private data from natural language to structured query language Technical Field The application relates to the technical field of artificial intelligence, in particular to a method and a device for distilling private domain data from natural language to structured query language. Background Currently, large model (Large Language Model, LLM) based natural language to structured query language (Natural Language to Structured Query Language, nl2 SQL) techniques have demonstrated some SQL statement generation capabilities in many areas. However, in the actual engineering application scenario, the data structures related to various computer application scenarios are characterized, and the private domain data has obvious differences from the public data in terms of terms expression and storage structure. This results in models trained with published data that are difficult to directly perform in a real production environment. In addition, the large model of the ultra-large scale parameter has extremely high requirement on computational power, and the privatized deployment of the model in various data application scenes of enterprises faces huge cost challenges. Even if the model quantization and other technologies are adopted for deployment, the requirements of the service on efficient and accurate data query can not be met, and the wide application of the nl2sql technology in actual service scenes is further limited. Disclosure of Invention The application mainly provides a method and a device for distilling private domain data from natural language to structured query language, which are used for solving the problem that a large model for realizing the natural language to structured query language in the prior art cannot adapt to a private domain database and has high calculation power requirement on a large model of large-scale parameters. In order to solve the technical problems, the technical scheme adopted by the application is to provide a private domain data distillation method from natural language to structured query language. The method comprises the following steps: the identity of a first large model with tool calling capability is given to an agent for generating problems, the agent for generating the problems calls each tool function built in advance, and a problem list is generated based on a private domain database; updating the identity of the first large model to generate an agent for a structured query statement, and calling each tool function by the structured query statement generating agent to generate a corresponding structured query statement list based on the problem list and the private database; and training a second large model of the small-scale parameters based on the problem list and the corresponding structured query statement list to obtain a private data distillation model. In an optional implementation manner of the embodiment of the present application, the problem generating agent calls each pre-constructed tool function, and generates a problem list based on a private database, including: responding to the input scene character prompting words, analyzing the scene character prompting words by the problem generating agent, and extracting scene character information; And calling each tool function to search in the private domain database independently based on the scene character information, and generating the problem list corresponding to the scene character prompt words. In an optional implementation manner of the embodiment of the present application, before generating the problem list, the method further includes: generating a plurality of initial problems generated by an agent for a single time for the problems, and calculating first similarity between the initial problems; Deleting the corresponding initial problem in response to the first similarity being higher than a first preset similarity threshold; And responding to the initial questions obtained after the multiple times of generation of the question generation agent reach a preset number, and constructing the question list based on the preset number of initial questions. In an optional implementation manner of the embodiment of the present application, after generating the problem list, the method further includes: converting each initial problem of the problem list into each semantic vector, and obtaining a first cluster based on a density clustering algorithm; Based on binary search framework and combining semantic verification of a large model, performing iterative optimization on the first cluster to obtain a cluster threshold; Updating the first clusters based on the cluster threshold to obtain a plurality of second clusters; And removing the duplication of each second polymer based on a preset semantic filtering algorithm and a preset expression filtering algorithm to obtain the duplication-removed problem list. In an optional implementation manner of the embodiment of the p