US-20260129061-A1 - AGENT-BASED SEARCHING OVER HETEROGENEOUS DATA SOURCES
Abstract
Searching heterogeneous cybersecurity data sources is described. A natural language search enables searching across available data sources, independent of underlying storage architectures or query languages. This capability leverages multiple task-specific agents, large language models, and data map representation-based approaches, and enables integration of new data sources into the search process with minimal additional computing resources. The searching includes determining a data map representation for available data sources (which might have different storage architectures or query languages), which indicates relationships between the available data sources. Query intent associated with a query is determined. A query intent task is determined by mapping the query intent to a relevant data source according to the data map representation. A search agent parameterized for searching the relevant data source executes the query intent task. The result of this executed query intent task is used to augment a response to the query.
Inventors
- Andrew White WICKER
- Max Piasevoli
- Quang Minh Nguyen
- Srisuma MOVVA
- Kadri Tahsildoost
- Haijun ZHAI
- Anand Mudgerikar
Assignees
- MICROSOFT TECHNOLOGY LICENSING, LLC
Dates
- Publication Date
- 20260507
- Application Date
- 20241105
Claims (20)
- 1 . A method for querying heterogeneous cybersecurity data, comprising: determining a data map representation for available data sources, the data map representation indicating relationships between the available data sources, the available data sources having at least one of different storage architectures or different query languages; determining a query intent associated with an input query by identifying entities and context associated with the input query; determining a query intent task by mapping the query intent to a relevant data source in the available data sources according to the data map representation and the entities and context; executing the query intent task with a search agent parameterized for searching the relevant data source; and outputting a result of an executed query intent task.
- 2 . The method of claim 1 , further comprising: determining multiple query intent tasks by mapping the query intent to multiple relevant data sources in the available data sources according to the data map representation, and the entities and context; and executing the multiple query intent tasks with one or more search agents, the one or more search agents including the search agent.
- 3 . The method of claim 1 , further comprising: determining multiple query intent tasks by mapping the query intent to multiple relevant data sources in the available data sources according to the data map representation, and the entities and context; executing the multiple query intent tasks with multiple search agents individually parameterized for searching the multiple relevant data sources; enhancing results of the multiple query intent tasks by causing communication of information relevant to the query intent among the multiple search agents, the information relevant to the query intent determined by the multiple search agents as part of executing the multiple query intent tasks; aggregating the results of the multiple query intent tasks; and outputting the result based on an aggregation of the results of the multiple query intent tasks.
- 4 . The method of claim 3 , further comprising: determining multiple sub-query intents associated with the input query by identifying the entities and context associated with the input query; and determining the multiple query intent tasks by mapping the multiple sub-query intents to one or more relevant data sources according to the data map representation, and the entities and context.
- 5 . The method of claim 3 , wherein the information relevant to the query intent comprises at least one of insights conveying relevant data for satisfying the input query or a result from a given search agent's query intent task.
- 6 . The method of claim 3 , wherein causing the communication of information relevant to the query intent among the multiple search agents comprises providing output from one search agent as additional input to another search agent for performing that search agent's query intent task.
- 7 . The method of claim 1 , wherein the relationships between available data sources comprise at least one of semantic similarities, joinable fields, common labels, or metadata indicative of commonalities between data in the available data sources.
- 8 . The method of claim 1 , wherein determining the data map representation comprises generating at least one of descriptions or metadata for the available data sources, the descriptions or metadata configured to enhance semantic understanding of data in an available data source.
- 9 . The method of claim 1 , wherein the data map representation is determined using various semantic functions, wherein a semantic function: comprises an algorithm configured to capture meaning behind data in the available data sources, rather than surface-level structure of the data in the available data sources; is configured to provide information about how to query the relevant data source; and is configured to provide information about how data between two or more different available data sources is related.
- 10 . The method of claim 1 , wherein entities comprise tokens associated with at least one of specific, identifiable items or concepts that have meaning within the input query, and wherein the entities are associated with an application of interest.
- 11 . The method of claim 1 , wherein context comprises tokens associated with at least one of surrounding information, background, or situational factors present in the input query that influence determining the query intent associated with the input query.
- 12 . The method of claim 1 , wherein determining the query intent associated with an input query comprises decomposing the input query by evaluating qualifiers in the input query related to at least one of the entities or the context.
- 13 . A system for querying heterogeneous cybersecurity data, comprising: a processor; and memory storing instructions that, when executed by the processor, cause the processor to: determine a query intent associated with an input query by identifying entities and context associated with the input query; determine a query intent task by mapping the query intent to a relevant data source according to a data map representation, and the entities and context and the data map representation indicating relationships between the relevant data source and other available data sources; execute the query intent task with a search agent parameterized for searching the relevant data source; and output a result of an executed query intent task.
- 14 . The system of claim 13 , wherein the instructions further cause the processor to: determine multiple query intent tasks by mapping the query intent to multiple relevant data sources in the available data sources according to the data map representation, and the entities and context; and execute the multiple query intent tasks with one or more search agents, the one or more search agents including the search agent.
- 15 . The system of claim 13 , wherein the instructions further cause the processor to: determine multiple query intent tasks by mapping the query intent to multiple relevant data sources in the available data sources according to the data map representation, and the entities and context; execute the multiple query intent tasks with multiple search agents individually parameterized for searching the multiple relevant data sources; enhance results of the multiple query intent tasks by causing communication of information relevant to the query intent among the multiple search agents, the information relevant to the query intent determined by the multiple search agents as part of executing the multiple query intent tasks; aggregate the results of the multiple query intent tasks; and output the result based on an aggregation of the results of the multiple query intent tasks.
- 16 . The system of claim 15 , wherein the instructions further cause the processor to: determine multiple sub-query intents associated with the input query by identifying the entities and context associated with the input query; and determine the multiple query intent tasks by mapping the multiple sub-query intents to one or more relevant data sources according to the data map representation, and the entities and context.
- 17 . A non-transitory computer readable medium having instructions thereon, the instructions, when executed by a computer, causing the computer to perform operations for querying heterogeneous cybersecurity data comprising: determining a data map representation for available data sources, the data map representation indicating relationships between the available data sources, the available data sources having at least one of different storage architectures or different query languages; determining multiple sub-query intents associated with an input query by identifying entities and context associated with the input query; determining multiple query intent tasks by mapping the multiple sub-query intents to multiple relevant data sources in the available data sources according to the data map representation and the entities and context; executing the multiple query intent tasks with multiple search agents individually parameterized for searching the multiple relevant data sources; enhancing results of the multiple query intent tasks by causing communication of information relevant to the multiple sub-query intents among the multiple search agents, the information relevant to the multiple sub-query intents determined by the multiple search agents as part of executing the multiple query intent tasks; aggregating the results of the multiple query intent tasks; and outputting an aggregated result.
- 18 . The medium of claim 17 , wherein the information relevant to the multiple sub-query intents comprises at least one of insights conveying relevant data for satisfying the input query or a result from a given search agent's query intent task.
- 19 . The medium of claim 17 , wherein causing the communication of information relevant to the multiple sub-query intents among the multiple search agents comprises providing output from one search agent as additional input to another search agent for performing that search agent's query intent task.
- 20 . The medium of claim 17 , wherein the available data sources comprise a data estate of a user with at least one of databases, data tables, or columns of data.
Description
BACKGROUND This disclosure generally relates to agent-based searching over heterogeneous data sources. Retrieval-Augmented Generation (RAG) is a technique that combines the strengths of information retrieval and natural language generation. In RAG, when a user provides a query or input to a large language or other model, relevant information from an external database, document, or other source is retrieved. The retrieved information is used to guide or enhance the generation of a more accurate, contextually relevant response to the query. By leveraging both retrieval and generation, RAG can handle complex queries, ensuring that generated outputs are coherent and grounded in relevant, up-to-date information from external sources. This approach improves the quality and accuracy of responses in tasks requiring deep knowledge or current information, such as question answering or summarization. SUMMARY For many applications, the vast number of available data sources (e.g., external databases, documents, or other sources) makes it difficult to know where to find relevant information (for RAG or other techniques) needed to answer a query. In addition, users are often unable to make optimal decisions to provide guidance (for RAG or other techniques) as to where or what to search due to their own limited knowledge of available data sources, such as when a new external database of information has only recently been created. There are also a wide variety of storage architectures used for these external data sources, each of which might require a different query language. These challenges result in a burdensome search requiring significant computing resources (e.g., a plurality of singularly programmed search agents, sometimes one for each data source architecture or query language as one example) and computing effort (e.g., computing resources used for review and identification of a relevant data source from a large number of available data sources) that is time-consuming and error prone, among other issues. Advantageously, dynamic agent-based searching over heterogeneous data sources is described. A natural language search capability enables users and autonomous agents to search across available data sources, independent of the underlying storage architectures or query languages. This capability leverages multiple task-specific agents, multimodal models, and data map representation-based approaches. Reasoning based on a data map representation directs searches by the task-specific agents. This enables integration of new data sources into the search process with minimal additional computing resources, among other advantages. Some embodiments include a method for dynamic searching over heterogeneous data sources. Querying heterogeneous cybersecurity data is one example practical application. The method comprises determining a data map representation for available data sources. The data map representation indicates relationships between the available data sources. The available data sources have at least one of different storage architectures or different query languages. The method comprises determining a query intent associated with an input query by identifying entities and context associated with the input query. A query intent comprises an inference as to why a user asked a question or made a request in the input query, for example. The method comprises determining a query intent (or search) task by mapping the query intent to a relevant data source in the available data sources according to the data map representation, and identified entities and context (e.g., reasoning). The query intent task is executed with a search agent parameterized for searching the relevant data source (e.g., the reasoning directing the search agent). Output comprises a result of an executed query intent task. The result of this executed query intent task is used to augment a response to the query for a user. In some embodiments, multiple query intent tasks are determined by mapping the query intent to multiple relevant data sources in the available data sources according to the data map representation, and the identified entities and context. The multiple query intent tasks are executed with one or more search agents. In some embodiments, the multiple query intent tasks are executed with multiple search agents individually parameterized for searching the multiple relevant data sources. Results of the multiple query intent tasks are enhanced by causing communication of information relevant to the query intent among the multiple search agents. Causing the communication of information relevant to the query intent among the multiple search agents comprises providing output from one search agent as additional input to another search agent for performing that search agent's query intent task. The information relevant to the query intent is determined by the multiple search agents as part of executing the multiple query intent tasks. The inform