Search

US-20260127163-A1 - STRUCTURED QUERY LANGUAGE STATEMENT VALIDATION BASED ON MACHINE LEARNING

US20260127163A1US 20260127163 A1US20260127163 A1US 20260127163A1US-20260127163-A1

Abstract

An example operation may include one or more of receiving a natural language input, executing a generative machine learning (ML) model on the natural language input to generate a structured query language (SQL) statement, executing the SQL statement on fake mockup data to generate a first query result on the fake mockup data, executing the generative ML model on the natural language input and the fake mockup data to generate a second query result on the fake mockup data, and determining whether the SQL statement is valid based on a comparison of the first query result and the second query result.

Inventors

  • Qi Liang Zhou
  • Rui Han
  • Yuan Yuan Ding
  • Huan Da Wang
  • Qiu Ming Zhu
  • Ya Juan Dang
  • Ying Wei

Assignees

  • INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date
20260507
Application Date
20241103

Claims (20)

  1. 1 . An apparatus comprising: a memory; and at least one processor, communicatively coupled to the memory, the at least one processor configured to: receive a natural language input; execute a generative machine learning (ML) model on the natural language input to generate a structured query language (SQL) statement; generate, by the ML model, fake mockup data based on a database that stores productive data; execute the SQL statement on fake mockup data to generate a first query result comprising a first subset of data of the fake mockup data execute the generative ML model on the natural language input and the fake mockup data to generate a second query result comprising a second subset of data of the fake mockup data; determine whether the SQL statement is valid by determining whether the first subset of data includes all content in the second subset of data; execute the SQL statement on the productive data stored within the database to generate a productive query result.
  2. 2 . The apparatus of claim 1 , wherein the at least one processor is configured to determine that the SQL statement is valid when content included in the first query result includes all content included in the second query result.
  3. 3 . The apparatus of claim 1 , wherein the at least one processor is configured to determine that the SQL statement is invalid when content included in the first query result does not include all content included in the second query result.
  4. 4 . The apparatus of claim 1 , wherein the at least one processor is configured to output the productive query result to a software application, in response to a determination that the SQL statement is valid.
  5. 5 . The apparatus of claim 1 , wherein the at least one processor is configured to execute the generative ML model on a schema of the database and a data type associated with the SQL statement to generate the fake mockup data, prior to execution of the SQL statement on the fake mockup data.
  6. 6 . The apparatus of claim 1 , wherein the at least one processor is configured to simultaneously execute the SQL statement on the fake mockup data to generate the first query result on the fake mockup data and execute the generative ML model on the natural language input and the fake mockup data to generate the second query result on the fake mockup data.
  7. 7 . The apparatus of claim 1 , wherein the first query result comprises a first subset of tabular data extracted from the fake mockup data and the second query result comprises a second subset of tabular data extracted from the fake mockup data, wherein the at least one processor is configured to validate the SQL statement based on a comparison of the first subset of tabular data to the second subset of tabular data.
  8. 8 . A method comprising: receiving a natural language input; executing a generative machine learning (ML) model on the natural language input to generate a structured query language (SQL) statement; generating, by the ML model, fake mockup data based on a database that stores productive data; executing the SQL statement on fake mockup data to generate a first query result comprising a first subset of data of the fake mockup data; executing the generative ML model on the natural language input and the fake mockup data to generate a second query result comprising a second subset of data of the fake mockup data; determining whether the SQL statement is valid by determining whether the first subset of data includes all content in the second subset of data; and executing the SQL statement on the productive data stored within the database to generate a productive query result.
  9. 9 . The method of claim 8 , wherein the determining comprises determining that the SQL statement is valid when content included in the first query result includes all content included in the second query result.
  10. 10 . The method of claim 8 , wherein the determining comprises determining that the SQL statement is invalid when content included in the first query result does not include all content included in the second query result.
  11. 11 . The method of claim 8 , comprising outputting the productive query result to a software application, in response to a determination that the SQL statement is valid.
  12. 12 . The method of claim 8 , comprising executing the generative ML model on a schema of the database and a data type associated with the SQL statement to generate the fake mockup data, prior to execution of the SQL statement on the fake mockup data.
  13. 13 . The method of claim 8 , wherein the executing the SQL statement comprises simultaneously executing the SQL statement on the fake mockup data to generate the first query result on the fake mockup data and executing the generative ML model on the natural language input and the fake mockup data to generate the second query result on the fake mockup data.
  14. 14 . The method of claim 8 , wherein the first query result comprises a first subset of tabular data extracted from the fake mockup data and the second query result comprises a second subset of tabular data extracted from the fake mockup data, and the determining comprises validating the SQL statement based on a comparison of the first subset of tabular data to the second subset of tabular data.
  15. 15 . A computer-readable storage medium comprising instructions which when executed by a processor cause the processor to perform: receiving a natural language input; executing a generative machine learning (ML) model on the natural language input to generate a structured query language (SQL) statement; generating, by the ML model, fake mockup data based on a database that stores productive data; executing the SQL statement on fake mockup data to generate a first query result comprising a first subset of data of the fake mockup data; executing the generative ML model on the natural language input and the fake mockup data to generate a second query result comprising a second subset of data of the fake mockup data; determining whether the SQL statement is valid by determining whether the first subset of data includes all content in the second subset of data; and executing the SQL statement on the productive data stored within the database to generate a productive query result.
  16. 16 . The computer-readable storage medium of claim 15 , wherein the determining comprises determining that the SQL statement is valid when content included in the first query result includes all content included in the second query result.
  17. 17 . The computer-readable storage medium of claim 15 , wherein the determining comprises determining that the SQL statement is invalid when content included in the first query result does not include all content included in the second query result.
  18. 18 . The computer-readable storage medium of claim 15 , wherein the processor is configured to perform outputting the productive query result to a software application, in response to a determination that the SQL statement is valid.
  19. 19 . The computer-readable storage medium of claim 15 , wherein the processor is configured to perform executing the generative ML model on a schema of the database and a data type associated with the SQL statement to generate the fake mockup data, prior to execution of the SQL statement on the fake mockup data.
  20. 20 . The computer-readable storage medium of claim 15 , wherein the executing the SQL statement comprises simultaneously executing the SQL statement on the fake mockup data to generate the first query result on the fake mockup data and executing the generative ML model on the natural language input and the fake mockup data to generate the second query result on the fake mockup data.

Description

BACKGROUND One of the most common mechanisms for accessing large amounts of structured data, such as tabular data, is through structured query language (SQL) commands. Recently, machine learning has been used to generate an executable SQL command using generative capability, and then, subsequently, this SQL command is employed to fetch data from the database. However, SQL commands generated by machine learning models are not completely accurate and often require manual work to verify the accuracy of the SQL commands. In addition, it is difficult for non-expert users to tell whether the generated SQL commands are accurate. SUMMARY One example embodiment provides an apparatus that includes a memory, and at least one processor communicatively coupled to the memory, the at least one processor may perform one or more of receive a natural language input, execute a generative machine learning (ML) model on the natural language input to generate a structured query language (SQL) statement, execute the SQL statement on fake mockup data to generate a first query result on the fake mockup data, execute the generative ML model on the natural language input and the fake mockup data to generate a second query result on the fake mockup data, and determine whether the SQL statement is valid based on a comparison of the first query result and the second query result. Another example embodiments provides a method that may include one or more of receiving a natural language input, executing a generative machine learning (ML) model on the natural language input to generate a structured query language (SQL) statement, executing the SQL statement on fake mockup data to generate a first query result on the fake mockup data, executing the generative ML model on the natural language input and the fake mockup data to generate a second query result on the fake mockup data, and determining whether the SQL statement is valid based on a comparison of the first query result and the second query result. A further example embodiment provides a computer-readable storage medium with instructions which when executed by a processor cause the processor to perform one or more of receiving a natural language input, executing a generative machine learning (ML) model on the natural language input to generate a structured query language (SQL) statement, executing the SQL statement on fake mockup data to generate a first query result on the fake mockup data, executing the generative ML model on the natural language input and the fake mockup data to generate a second query result on the fake mockup data, and determining whether the SQL statement is valid based on a comparison of the first query result and the second query result. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram illustrating a computing environment according to an embodiment of the instant solution. FIG. 2 is a diagram illustrating a process of generating and evaluating the accuracy of an SQL query statement according to the examples and features of the instant solution. FIG. 3A is a diagram illustrating a process of generating an SQL command using a generative ML model according to the examples and features of the instant solution. FIG. 3B is a diagram illustrating a process of generating fake mockup data according to the examples and features of the instant solution. FIG. 3C is a diagram illustrating a process of querying the fake mockup data based on the SQL command to generate a first query result according to the examples and features of the instant solution. FIG. 3D is a diagram illustrating a process of querying the fake mockup data with the generative ML model to generate a second query result according to the examples and features of the instant solution. FIG. 3E is a diagram illustrating a process of verifying the SQL command based on the first and second query results according to the examples and features of the instant solution. FIG. 3F is a diagram illustrating a process of executing the verified SQL command on productive data according to the examples and features of the instant solution. FIGS. 4A-4C are diagrams illustrating different examples of verification results according to the examples and features of the instant solution. FIG. 5A is a diagram illustrating a flow diagram, according to example embodiments. FIG. 5B is a diagram illustrating a flow diagram, according to example embodiments. DETAILED DESCRIPTION It is to be understood that although this disclosure includes a detailed description of cloud computing, implementation of the teachings recited herein is not limited to a cloud computing environment. Rather, embodiments of the instant solution can be implemented in conjunction with any other type of computing environment now known or later developed. The example embodiments are directed to an evaluation system that automatically validates the accuracy of an SQL statement generated with a machine learning model, such as a large language model (LLM) with