US-12619501-B2 - Data retrieval using embeddings for data in backup systems

US12619501B2US 12619501 B2US12619501 B2US 12619501B2US-12619501-B2

Abstract

In general, techniques for efficient data retrieval from a backup system are described. An example computing system includes one or more storage devices and processing circuitry having access to the one or more storage devices and configured to: process an input to generate a filter, wherein the input indicates a context for one or more queries; apply the filter to backup data to obtain filtered data from the backup data; generate an index of embeddings from the filtered data; process, based on the index of embeddings, a query to generate a response for the query; and output the response.

Inventors

Gregory Statton
Sanjay Poonen
Mohit Aron
Apurv Gupta

Assignees

Cohesity, Inc.

Dates

Publication Date: 20260505
Application Date: 20240327
Priority Date: 20230504

Claims (15)

1 . A computing system comprising: one or more storage devices; and processing circuitry having access to the one or more storage devices and configured to: process an input received from a user or application, the input comprising a natural language query and indicative of a context for one or more queries subsequently expected from a user or application, to dynamically generate a filter; apply the dynamically generated filter to backup data to obtain filtered data from the backup data; encode the filtered data to generate an embedding for each item of the filtered data; generate, from the generated embeddings, an on-demand index of embeddings for the filtered data filtered according to the dynamically generated filter; process, based on the on-demand index of embeddings, a subsequent retrieval augmented generation (RAG) query to generate, by a language model, a context-aware response for the subsequent RAG query; output the context-aware response; and based on at least one of a determination that a period of time has elapsed since the generation of the on-demand index of embeddings or a determination that a number of times the on-demand index of embeddings is used over a period of time is below a threshold, delete the on-demand index of the embeddings.
2 . The computing system of claim 1 , wherein the dynamically generated filter specifies one or more of a file type, an association with an entity, a date, a time, or a topic.
3 . The computing system of claim 1 , wherein to dynamically generate the filter the processing circuitry is configured to: tokenize the backup data to generate preprocessed text data; apply a CountVectorizer to the preprocessed text data to compute a matrix of token counts; process the matrix of token counts with a machine learning model to classify items of the preprocessed text data to any of a plurality of classes.
4 . The computing system of claim 3 , wherein to apply the dynamically generated filter the processing circuitry is configured to: include, in the filtered data, items of preprocessed text data that are assigned a class that matches the filter.
5 . The computing system of claim 1 , wherein to dynamically generate the filter the processing circuitry is configured to: apply a role-based access control to at least one of the subsequent RAG query or to the filter generation.
6 . The computing system of claim 1 , wherein to process the subsequent RAG query the processing circuitry is configured to: apply retrieval augmented generation using the subsequent RAG query and the on-demand index of embeddings to generate the context-aware response for the subsequent RAG query.
7 . The computing system of claim 1 , wherein the input further comprises the subsequent RAG query.
8 . The computing system of claim 1 , wherein the processing circuitry is configured to receive the subsequent RAG query after the on-demand index of embeddings is generated.
9 . The computing system of claim 1 , wherein the processing circuitry is configured to store at least a portion of the backup data to a cache.
10 . The computing system of claim 9 , wherein the processing circuitry is configured to generate or modify an embedding of the on-demand index of embeddings with a reference to corresponding backup data for the embedding stored to the cache.
11 . The computing system of claim 9 , wherein the processing circuitry is configured to process, based on the on-demand index of embeddings and the cache, a second subsequent RAG query to generate a context-aware response for the second subsequent RAG query.
12 . A method comprising: processing, by a computing system, an input received from a user or application, the input comprising a natural language query and indicative of a context for one or more queries subsequently expected from a user or application, to dynamically generate a filter; applying the dynamically generated filter to backup data to obtain filtered data from the backup data; encoding the filtered data to generate an embedding for each item of the filtered data; generating, from the generated embeddings, an on-demand index of embeddings for the filtered data filtered according to the dynamically generated filter; processing, based on the on-demand index of embeddings, a subsequent retrieval augmented generation (RAG) query to generate, by a language model, a context-aware response for the subsequent RAG query; outputting the context-aware response; and based on at least one of a determination that a period of time has elapsed since the generation of the on-demand index of embeddings or a determination that a number of times the on-demand index of embeddings is used over a period of time is below a threshold, deleting the on-demand index of the embeddings.
13 . The method of claim 12 , wherein processing the subsequent RAG query comprises applying retrieval augmented generation using the subsequent RAG query and the on-demand index of embeddings to generate the context-aware response for the subsequent RAG query.
14 . The method of claim 12 , further comprising: storing at least a portion of the backup data to a cache; generating or modifying an embedding of the on-demand index of embeddings with a reference to corresponding backup data for the embedding stored to the cache; and processing, based on the on-demand index of embeddings and the cache, a second subsequent RAG query to generate a context-aware response for the second subsequent RAG query.
15 . Non-transitory computer-readable media comprising instructions that, when executed by processing circuitry, cause the processing circuitry to: process an input received from a user or application, the input comprising a natural language query and indicative of a context for one or more queries subsequently expected from a user or application, to dynamically generate a filter; apply the dynamically generated filter to backup data to obtain filtered data from the backup data; encode the filtered data to generate an embedding for each item of the filtered data; generate, from the generated embeddings, an on-demand index of embeddings for the filtered data filtered according to the dynamically generated filter; process, based on the on-demand index of embeddings, a subsequent retrieval augmented generation (RAG) query to generate, by a language model, a context-aware response for the subsequent RAG query; output the context-aware response; and based on at least one of a determination that a period of time has elapsed since the generation of the on-demand index of embeddings or a determination that a number of times the on-demand index of embeddings is used over a period of time is below a threshold, delete the on-demand index of the embeddings.

Description

RELATED APPLICATIONS This application claims the benefits of U.S. Provisional Patent Application No. 63/503,631, filed 22 May 2023, and of India Provisional Patent Application No. 202341031783, filed 4 May 2023; the entire content of each application is incorporated herein by reference. TECHNICAL FIELD This disclosure relates to data platforms for computing systems and, more particularly, to data retrieval from backup systems. BACKGROUND Data platforms that support computing applications rely on primary storage systems to support latency sensitive applications. A secondary storage system is often relied upon to support secondary use cases such as backup and archive. Backup data is commonly queried to retrieve specific information or datasets from storage systems, enabling data analysis, data recovery, data mining, forensic analysis, and compliance with regulatory requirements. Many data platform solutions maintain an index or catalog of backed-up data, which facilitates efficient querying of backup data. The data platform enables users to search the backup index based on query criteria, and the data platform executes a query against the backup index, where the query specifies the search criteria and any additional parameters required. The query may involve searching for specific files, folders, databases, email messages, or other types of data stored in a backup. Based on the query results, which can include metadata information describing the backup data, such as file names, sizes, timestamps, and backup versions, the user can select specific data or datasets to retrieve from the backup. This may involve selecting individual objects or entire backups. SUMMARY In general, techniques for artificial intelligence (AI)-enhanced and efficient data retrieval from a backup system are described. In some examples, a data platform produces an index of embeddings for filtered backup data stored on a backup system. This index of embeddings may be effectively “scoped” to a context for a set of one or more queries expected from a user or application and, in some cases, may be generated in an on-demand manner based on received inputs. A response generation platform receives an input indicative of context for queries to the response generation platform. A filter generator processes the input to determine types of data relevant to queries. For example, the filter generator may analyze the input using a machine learning model to decode the types of data the user is interested in (e.g., Email data, File Share data, Databases, or other unstructured data). The filter generator may generate a filter unique to the input, based on the decoded data types, and the response generator platform applies the filter to data of backups to create an index of embeddings based on the data that is filtered using the filter generated based on the input. This index of embeddings is then available to drive retrieval augmented generation (RAG) queries of the backup data. The techniques may provide one or more technical advantages. For example, the techniques may allow customers or other users to make stored backup, archive, or other data “AI-Ready” by creating an index of advanced metadata/embeddings for the stored data and, in some aspects, securing that index through fine-grained role-based access controls. The customers and other users that store backup or other data on a storage system may re-leverage that data using Artificial Intelligence and machine learning models to gain other efficiency elsewhere in their workflows, while keeping the data securely associated with the data platform. In some examples, the response generation platform is a retrieval-augmented response generation platform that accepts a user or application input, such as a question or a query. The input may be tokenized with some keywords extracted that are used to filter the large of amount of data included the backup data to filter down to a smaller subset of data. The response generation platform then selects representations from within those documents or objects that are most relevant to the user or machine query as an index of embeddings. The index of embeddings is provided, along with the original query, to a Language Model to enable query processor to provide a context-aware response. Additional one or more queries may be received that are relevant to the context indicated by the input, and a query processor may also use an index of embeddings to generate corresponding responses for the one or more queries. This innovative approach allows generated responses to not only be knowledgeable but also diverse and relevant to domain-specific content. The techniques leverage AI and machine learning, in particular generative AI, to inspect data managed by a data platform and produce new and original content based on that data. Generative AI tools use sophisticated algorithms to assess data and derive novel and unique insights, thereby improving decision-making and streamlining