US-12625869-B2 - Generative AI-driven multi-source data query system

US12625869B2US 12625869 B2US12625869 B2US 12625869B2US-12625869-B2

Abstract

Embodiments of the disclosed technologies include, in response to receiving a query, matching the query to metadata from a plurality of heterogeneous data sources, and selecting one or more data sources from the plurality of heterogeneous data sources for answering the query, by sending the query and embeddings of the matched metadata to a generative artificial intelligence (GAI), and prompting the GAI to select matching data sources. Based on the data from the GAI, generating one or more custom queries targeted to the matching data sources selected by the GAI, the custom queries formatted to be sent to the selected data sources, executing the one or more custom queries across the selected data sources, and summarizing results from the executing and providing a response to the query.

Inventors

Sagar Ketan Shah
Amir Jalali

Assignees

MICROSOFT TECHNOLOGY LICENSING, LLC

Dates

Publication Date: 20260512
Application Date: 20231109

Claims (19)

1 . A method comprising: training a generative artificial intelligence (GAI) model based on pre-processing data sources, wherein the pre-processing of each of the data sources comprises: generating a supplemental dataset for a data source described by a first metadata, the generating comprising: extracting a schema of the data source; extracting representative sample data from the data source; prompting the GAI model to generate synthetic queries for the data source based on the schema and the representative sample data to query across the data source, wherein the schema, the representative sample data, and the synthetic queries comprise the supplemental dataset; running each of the synthetic queries generated by the GAI model against the data source to validate the synthetic queries; and storing the validated synthetic queries as part of the supplemental dataset; storing the supplemental dataset in a vector store, wherein the supplemental dataset supplements the first metadata to describe the data source; generating embeddings of the first metadata and the supplemental dataset; receiving a query; prompting the GAI model to select a matching data source for responding to the query, by sending the query and the embeddings of the first metadata and the supplemental data sent to the GAI model and prompting the GAI model to select the matching data source; prompting the GAI model to generate a custom query targeted to the matching data source selected by the GAI model by sending the embeddings and a subset of the synthetic queries; executing the custom query at the matching data source; summarizing results from executing the custom query; and providing a response to the query based on the results.
2 . The method of claim 1 , further comprising filtering the supplemental dataset, wherein the filtering comprises: identifying one or more text chunks in the query; matching each text chunk to the metadata and the supplemental data representing data types and relationships in the data sources; customizing the custom query by embedding the matched metadata associated with the text chunks to the query to generate a dynamic query.
3 . The method of claim 1 , further comprising: generating single subject questions from the query, using the GAI model.
4 . The method of claim 3 , further comprising: identifying any dependency between the single subject questions returned by the GAI model; and sequencing the single subject questions based on the identified dependency.
5 . The method of claim 3 , further comprising: after receiving a first response to a first custom query from a first data source, utilizing the first response to update a second custom query, prior to sending the second query to a data source.
6 . The method of claim 1 , wherein the selecting the data source comprises: matching a query to the synthetic queries associated with each of the data sources.
7 . The method of claim 6 , wherein the pre-processing further extracts data relationships in the data source.
8 . The method of claim 7 , wherein the schema, the data relationships, and the sample data extracted from the data source is formatted into a descriptive text and stored in the vector store.
9 . The method of claim 1 , wherein generating a custom query comprises: identifying a language associated with the data source; identifying a data format for the data source; translating the query into the language and data format of the data source; and attaching metadata associated with the data source.
10 . The method of claim 1 , wherein the query is a natural language query.
11 . A system comprising: at least one processor; at least one memory coupled to the at least one processor, wherein the at least one memory comprises instructions that, when executed by the at least one processor, cause the at least one processor to perform at least one operation comprising: training a generative artificial intelligence (GAI) model based on pre-processing data sources, wherein the pre-processing of each of the data sources comprises: generating a supplemental dataset for a data source described by a first metadata, the generating comprising: extracting a schema of the data source; extracting representative sample data from the data source; prompting the GAI model to generate synthetic queries for the data source based on the schema and the representative sample data to query across the data source wherein the schema, the representative sample data, and the synthetic queries comprise the supplemental dataset; running each of the synthetic queries generated by the GAI model against the data source to validate the synthetic queries; and storing the validated synthetic queries as part of the supplemental dataset; storing the supplemental dataset in a vector store, wherein the supplemental dataset supplements the first metadata to describe the data source; generating embeddings of the first metadata and the supplemental dataset; receiving a query; prompting the GAI model to select a matching data source for answering the query, by sending the query and the embeddings of the first metadata and the supplemental data to the GAI model and prompting the GAI model to select the matching data source; prompting the GAI to generate a custom query targeted to the matching data source selected by the GAI model by sending the embeddings and a subset of the synthetic queries, the custom query formatted to be sent to the matching data source; executing the custom query at the matching data source; and summarizing results from the executing and providing a response to the query.
12 . The system of claim 11 , further comprising: identifying one or more text chunks in the query; matching each text chunk to the metadata and the supplemental data representing data types and relationships in the data sources; customizing the query by embedding the matched metadata and supplemental data associated with the text chunks to the query to generate a dynamic query.
13 . The system of claim 11 , further comprising: generating single subject questions from the query, using the GAI model.
14 . The system of claim 13 , further comprising: identifying any dependency between the single subject questions returned by the GAI model; sequencing the single subject questions based on the identified dependency; and after receiving a first response to a first custom query from a first data source, utilizing the first response to update a second custom query, prior to sending the second query to a data source.
15 . The system of claim 11 , wherein identifying the data sources comprises, for each question: matching the question to the synthetic queries associated with each of the the data sources.
16 . The system of claim 15 , wherein the pre-processing extracts data relationships in the data source.
17 . The system of claim 16 , wherein the schema, the data relationships, and the sample data is formatted into a descriptive text and stored in a vector store as the metadata for the data source.
18 . The system of claim 11 , wherein generating a custom query comprises: identifying a language associated with the data source; identifying a data format for the data source; translating the query into the language and data format of the data source; and attaching metadata associated with the data source.
19 . A non-transitory computer readable medium containing program instructions for causing a computer to perform a method comprising: training a generative artificial intelligence (GAI) model based on pre-processing heterogeneous data sources, wherein the pre-processing of each of the data sources comprises: generating a supplemental dataset for a data source described by a first metadata, the generating comprising: extracting a schema of the data source; extracting representative sample data from the data source; prompting the GAI model to generate synthetic queries for the data source based on the schema and the representative sample data to query across the data source, the representative sample data, and the synthetic queries comprise the supplemental dataset; running each of the synthetic queries generated by the GAI model against the data source to validate the synthetic queries; and storing the validated synthetic queries as part of the supplemental dataset; storing the supplemental dataset in a vector store, wherein the supplemental dataset supplements the first metadata to describe the data source; generating embeddings of the first metadata and the supplemental dataset; receiving a query; prompting the GAI to select a matching data source from the heterogeneous data sources for answering the query, by sending the query and the embeddings of the first metadata and the supplemental data to the GAI model and prompting the GAI model to select the matching data source; prompting the GAI to generate a custom query targeted to the matching data source selected by the GAI model by sending the GAI model the embeddings and a subset of the synthetic queries, the custom query formatted to be sent to the selected data sources; executing the custom query at the matching data source; summarizing results from the executing; and providing a response to the query.

Description

TECHNICAL FIELD Embodiments of the invention relate to the field of data access; and more specifically, to search across multiple heterogenous data sources. BACKGROUND ART Obtaining data from a data source generally requires formulating a query in the appropriate format and language for the particular data source. If the data may be in one of a set of data sources, each data source must be individually queried, complying with the format and language requirements for each data source. BRIEF DESCRIPTION OF THE DRAWINGS The invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention. In the drawings: FIG. 1 illustrates an example computing system that includes a querying system, according to some embodiments of the invention. FIG. 2 illustrates the querying system in more detail, according to some embodiments of the invention. FIG. 3 illustrates the training system, which generates data for the vector store, according to some embodiments of the invention. FIG. 4 is a flow diagram of an example method to generate responses across a plurality of data sources, in accordance with some embodiments of the present disclosure. FIG. 5 is a detailed flow diagram of an example method to generate responses across a plurality of data sources, in accordance with some embodiments of the present disclosure. FIG. 6 is a flow diagram of generating individual questions from a natural language query, in accordance with some embodiments of the present disclosure. FIG. 7 is a flow diagram of pre-processing a data source, in accordance with some embodiments of the present disclosure. FIG. 8 is a flow diagram of training the NPL systems, in accordance with some embodiments of the present disclosure. FIG. 9 is a block diagram of an example computer system in which embodiments of the present disclosure can operate. DETAILED DESCRIPTION A method and apparatus are described in which a generative artificial intelligence (GAI) is used to generate questions and identify a target data source to respond to the questions. The process receives a query, which may be a natural language query, and analyzes the query to identify one or more single subject questions using a GAI. The process then identifies one or more data sources from a plurality of heterogeneous data sources for answering the one or more questions, using the GAI. The process generates one or more custom queries targeted and formatted to the one or more data sources and executes the one or more custom queries across the one or more data sources. The process then summarizes the response, using the GAI, and provides the response. Aspects of the present disclosure are directed to providing an integrated front end to get responses to a query across multiple data sources. The process allows a user to enter a natural language query, and processes the query into one or more simple questions, using a generative AI large language model (GAI). The process then uses the GAI to identify one or more of a plurality of heterogenous data sources that could be used to answer each of the one or more simple questions. The process generates custom queries for each of the data sources, using the GAI. The custom queries are in the language and format of the data source. Conventional systems require users to know the content of all of the available data sources, their schema and language, and the query methods for each of those data sources. They then required the user to select the appropriate data source, query the data source, and gather information from it, to answer a question. Furthermore, conventional systems that use artificial intelligence systems to answer queries require training a model with large amounts of data, often terabytes of data. That technique isn't scalable when there are a significant number of disparate data sources that would be used to obtain a response to the query. The present system overcomes the issues with training a GAI to obtain data from a large number of data sources that have a large amount of rapidly changing data, by utilizing the GAI to construct the query and select the data source to answer the query, rather than having the GAI respond directly to the query. This enables the system to be updated when additional data sources become available, without extensive retraining of the GAI. Additionally, some data sources cannot be used for training due to privacy reasons, e.g., data sets which include personal information, or other restrictions, e.g., security or privilege restrictions. By utilizing the present approach, the system does not need to use such data to train the GAI. Furthermore, this avoids the need for the GAI to access personal data from the data sources during query construction or data source selection. The present system also enables the GAI to direct a query to a data set to which the GAI does not have access privileges. Thus, the present system also addr