Search

US-12619620-B2 - Self-healing generative AI/ML pipeline for generating complex data queries leveraging semantic data model

US12619620B2US 12619620 B2US12619620 B2US 12619620B2US-12619620-B2

Abstract

A method includes providing a user query to an AI/ML pipeline. The user query requests a response based on data stored in a data topology, and the data topology is modeled using a semantic data model. The method also includes generating an initial data access query for retrieving the data from the data topology using the AI/ML pipeline and the semantic data model. The method further includes determining that the initial data access query includes a hallucination or error and performing an automatic loop one or more times. The automatic loop includes generating an updated data access query for retrieving the data; determining whether the updated data access query includes a hallucination or error; and, if so, repeating the automatic loop. In addition, the method includes using a final data access query with no hallucination or error to retrieve the data from the data topology in order to generate the response.

Inventors

  • Jaimita Bansal
  • PIERRE DE BELEN

Assignees

  • Goldman Sachs & Co. LLC

Dates

Publication Date
20260505
Application Date
20250902

Claims (20)

  1. 1 . A method comprising: providing a user query to a self-healing multi-agent artificial intelligence/machine learning (AI/ML) pipeline, the user query requesting a response based on data stored in a data topology, the data topology modeled using a semantic data model; generating an initial data access query for retrieving the data from the data topology using the AI/ML pipeline and the semantic data model, the AI/ML pipeline prompting one or more AI/ML models to generate the initial data access query; determining that the initial data access query includes a hallucination or error; performing an automatic loop one or more times, wherein the automatic loop includes: providing additional information to the AI/ML pipeline; generating an updated data access query for retrieving the data from the data topology using the AI/ML pipeline and the semantic data model, the AI/ML pipeline prompting at least one of the one or more AI/ML models based on the additional information; determining whether the updated data access query includes a hallucination or error; and if the updated data access query includes a hallucination or error, repeating the automatic loop; using a final data access query with no hallucination or error to retrieve the data from the data topology; and generating the response using the data retrieved from the data topology.
  2. 2 . The method of claim 1 , wherein: the semantic data model represents the data topology and identifies dataspaces, classes, and properties associated with the data topology; agents of the AI/ML pipeline use the semantic data model to identify a specific dataspace, one or more specific classes, and one or more specific properties associated with the user query; and each data access query is generated based on the specific dataspace, the one or more specific classes, and the one or more specific properties.
  3. 3 . The method of claim 2 , wherein: the semantic data model provides context to the agents of the AI/ML pipeline; the agents comprise the one or more AI/ML models that generate responses when prompted by the agents; and the responses from the one or more AI/ML models identify the specific dataspace, the one or more specific classes, the one or more specific properties, and the data access queries.
  4. 4 . The method of claim 2 , wherein: the data topology includes tabular data; and the semantic data model allows the AI/ML pipeline to understand columns of data in the tabular data.
  5. 5 . The method of claim 1 , wherein the semantic data model models the data topology using multiple classes and associated properties that are semantically aligned with natural language on which the AI/ML pipeline is trained.
  6. 6 . The method of claim 5 , wherein at least some of the classes are associated with multiple associations in the semantic data model.
  7. 7 . The method of claim 1 , wherein: the AI/ML pipeline comprises a dataspace agent, a class agent, a property agent, a query agent, and a self-healing agent; the dataspace agent identifies one of multiple dataspaces associated with the user query; the class agent identifies at least one of multiple classes associated with the identified dataspace, the at least one identified class mapped to the data topology; the property agent identifies at least one of multiple properties within the at least one identified class; and the query agent generates each data access query based on at least one of: the at least one identified class and the at least one identified property.
  8. 8 . The method of claim 7 , wherein the self-healing agent determines, for each data access query, whether: a syntax of the data access query has one or more errors; at least one property in the data access query exists; one or more values in the data access query are proper; and a data type of a value in the data access query matches an expected data type.
  9. 9 . The method of claim 1 , wherein at least one of the data access queries is based on one or more of: filtering of at least one of classes and properties defined in the semantic data model based on the user query; and joining of at least one of classes and properties defined at multiple levels in the semantic data model based on the user query.
  10. 10 . An apparatus comprising: at least one processing device configured to: provide a user query to a self-healing multi-agent artificial intelligence/machine learning (AI/ML) pipeline, the user query requesting a response based on data stored in a data topology, the data topology modeled using a semantic data model; generate an initial data access query for retrieving the data from the data topology using the AI/ML pipeline and the semantic data model, the AI/ML pipeline configured to prompt one or more AI/ML models to generate the initial data access query; determine that the initial data access query includes a hallucination or error; perform an automatic loop one or more times, wherein, to perform the automatic loop, the at least one processing device is configured to: provide additional information to the AI/ML pipeline; generate an updated data access query for retrieving the data from the data topology using the AI/ML pipeline and the semantic data model, the AI/ML pipeline configured to prompt at least one of the one or more AI/ML models based on the additional information; determine whether the updated data access query includes a hallucination or error; and if the updated data access query includes a hallucination or error, repeat the automatic loop; use a final data access query with no hallucination or error to retrieve the data from the data topology; and generate the response using the data retrieved from the data topology.
  11. 11 . The apparatus of claim 10 , wherein: the semantic data model represents the data topology and identifies dataspaces, classes, and properties associated with the data topology; agents of the AI/ML pipeline are configured to use the semantic data model to identify a specific dataspace, one or more specific classes, and one or more specific properties associated with the user query; and the AI/ML pipeline is configured to generate each data access query based on the specific dataspace, the one or more specific classes, and the one or more specific properties.
  12. 12 . The apparatus of claim 11 , wherein: the semantic data model provides context to the agents of the AI/ML pipeline; the agents comprise the one or more AI/ML models configured to generate responses when prompted by the agents; and the responses from the one or more AI/ML models identify the specific dataspace, the one or more specific classes, the one or more specific properties, and the data access queries.
  13. 13 . The apparatus of claim 10 , wherein the semantic data model models the data topology using multiple classes and associated properties that are semantically aligned with natural language on which the AI/ML pipeline is trained.
  14. 14 . The apparatus of claim 13 , wherein at least some of the classes are associated with multiple associations in the semantic data model.
  15. 15 . The apparatus of claim 10 , wherein: the AI/ML pipeline comprises a dataspace agent, a class agent, a property agent, a query agent, and a self-healing agent; the dataspace agent is configured to identify one of multiple dataspaces associated with the user query; the class agent is configured to identify at least one of multiple classes associated with the identified dataspace, the at least one identified class mapped to the data topology; the property agent is configured to identify at least one of multiple properties within the at least one identified class; and the query agent is configured to generate each data access query based on at least one of: the at least one identified class and the at least one identified property.
  16. 16 . The apparatus of claim 15 , wherein the self-healing agent is configured to determine, for each data access query, whether: a syntax of the data access query has one or more errors; at least one property in the data access query exists; one or more values in the data access query are proper; and a data type of a value in the data access query matches an expected data type.
  17. 17 . A non-transitory computer readable medium containing instructions that when executed cause at least one processor to: provide a user query to a self-healing multi-agent artificial intelligence/machine learning (AI/ML) pipeline, the user query requesting a response based on data stored in a data topology, the data topology modeled using a semantic data model; generate an initial data access query for retrieving the data from the data topology using the AI/ML pipeline and the semantic data model, the AI/ML pipeline configured to prompt one or more AI/ML models to generate the initial data access query; determine that the initial data access query includes a hallucination or error; perform an automatic loop one or more times, wherein the instructions when executed cause the at least one processor, during the automatic loop, to: provide additional information to the AI/ML pipeline; generate an updated data access query for retrieving the data from the data topology using the AI/ML pipeline and the semantic data model, the AI/ML pipeline configured to prompt at least one of the one or more AI/ML models based on the additional information; determine whether the updated data access query includes a hallucination or error; and if the updated data access query includes a hallucination or error, repeat the automatic loop; use a final data access query with no hallucination or error to retrieve the data from the data topology; and generate the response using the data retrieved from the data topology.
  18. 18 . The non-transitory computer readable medium of claim 17 , wherein: the semantic data model represents the data topology and identifies dataspaces, classes, and properties associated with the data topology; agents of the AI/ML pipeline are configured to use the semantic data model to identify a specific dataspace, one or more specific classes, and one or more specific properties associated with the user query; and the AI/ML pipeline is configured to generate each data access query based on the specific dataspace, the one or more specific classes, and the one or more specific properties.
  19. 19 . The non-transitory computer readable medium of claim 18 , wherein: the semantic data model provides context to the agents of the AI/ML pipeline; the agents comprise the one or more AI/ML models configured to generate responses when prompted by the agents; and the responses from the one or more AI/ML models identify the specific dataspace, the one or more specific classes, the one or more specific properties, and the data access queries.
  20. 20 . The non-transitory computer readable medium of claim 17 , wherein: the AI/ML pipeline comprises a dataspace agent, a class agent, a property agent, a query agent, and a self-healing agent; the dataspace agent is configured to identify one of multiple dataspaces associated with the user query; the class agent is configured to identify at least one of multiple classes associated with the identified dataspace, the at least one identified class mapped to the data topology; the property agent is configured to identify at least one of multiple properties within the at least one identified class; and the query agent is configured to generate each data access query based on at least one of: the at least one identified class and the at least one identified property.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS AND PRIORITY CLAIM This application claims priority under 35 U.S.C. § 120 as a continuation of U.S. patent application Ser. No. 19/067,566 filed on Feb. 28, 2025, which claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 63/665,979 filed on Jun. 28, 2024. Both of these prior applications are hereby incorporated by reference in their entirety. TECHNICAL FIELD This disclosure is generally directed to generative artificial intelligence/machine learning (AI/ML). More specifically, this disclosure is directed to a self-healing generative AI/ML pipeline for generating complex data queries leveraging a semantic data model. BACKGROUND Generative artificial intelligence/machine learning (AI/ML) models today can respond to users' queries from a large text corpus, such as by using a statistical similarity vector search with retrieval-augmented generation (RAG) or fine-tuning models. Problems with generative AI/ML models and these methods include (i) hallucinations in which an AI/ML model makes up an incorrect response and (ii) context length limitations in which an AI/ML model can only use a limited amount of contextual information. AI/ML models can also fail to satisfactorily generate responses for queries or cannot directly understand/process certain types of data, such as relational data (like big tabular data with columns having relationships and patterns) or other data topologies. Traditionally, SQL queries are created to fetch relational data, but SQL queries are often limited to responding to queries of users who know the physical schema of the data. Users who are unfamiliar with how relational data is stored are unable to create SQL queries in a simple manner, such as by using natural language. SUMMARY This disclosure relates to a self-healing generative artificial intelligence/machine learning (AI/ML) pipeline for generating complex data queries leveraging a semantic data model. In a first embodiment, a method includes providing a user query to a self-healing multi-agent AI/ML pipeline. The user query requests a response based on data stored in a data topology, and the data topology is modeled using a semantic data model. The method also includes generating an initial data access query for retrieving the data from the data topology using the AI/ML pipeline and the semantic data model. The method further includes determining that the initial data access query includes a hallucination or error and performing an automatic loop one or more times. The automatic loop includes generating an updated data access query for retrieving the data from the data topology using the AI/ML pipeline and the semantic data model; determining whether the updated data access query includes a hallucination or error; and, if the updated data access query includes a hallucination or error, repeating the automatic loop. In addition, the method includes using a final data access query with no hallucination or error to retrieve the data from the data topology in order to generate the response. In a second embodiment, an apparatus includes at least one processing device configured to provide a user query to a self-healing multi-agent AI/ML pipeline. The user query requests a response based on data stored in a data topology, and the data topology is modeled using a semantic data model. The at least one processing device is also configured to generate an initial data access query for retrieving the data from the data topology using the AI/ML pipeline and the semantic data model. The at least one processing device is further configured to determine that the initial data access query includes a hallucination or error and perform an automatic loop one or more times. To perform the automatic loop, the at least one processing device is configured to generate an updated data access query for retrieving the data from the data topology using the AI/ML pipeline and the semantic data model; determine whether the updated data access query includes a hallucination or error; and, if the updated data access query includes a hallucination or error, repeat the automatic loop. In addition, the at least one processing device is configured to use a final data access query with no hallucination or error to retrieve the data from the data topology in order to generate the response. In a third embodiment, a non-transitory computer readable medium contains instructions that when executed cause at least one processor to provide a user query to a self-healing multi-agent AI/ML pipeline. The user query requests a response based on data stored in a data topology, and the data topology is modeled using a semantic data model. The non-transitory computer readable medium also contains instructions that when executed cause the at least one processor to generate an initial data access query for retrieving the data from the data topology using the AI/ML pipeline and the semantic data model. The non-trans