US-20260127227-A1 - INTELLIGENT, CUSTOMIZABLE RAG WITH CONTEXTUAL COMPRESSION

US20260127227A1US 20260127227 A1US20260127227 A1US 20260127227A1US-20260127227-A1

Abstract

In one implementation, a device retrieves a set of documents based on their relevancy to an input query from a user interface. The device extracts excerpts of varying sizes from the set of documents that are relevant to the input query. The device performs a ranking of the excerpts based on their relevancy to the input query. The device augments, based on the ranking, the input query based on one or more of the excerpts to form a prompt for input to a language model.

Inventors

Ali Payani
Mahesh Viswanathan
ANDREA MORANDI
Ramin Pishehvar

Assignees

CISCO TECHNOLOGY, INC.

Dates

Publication Date: 20260507
Application Date: 20241104

Claims (20)

1 . A method, comprising: retrieving, by a device, a set of documents based on their relevancy to an input query from a user interface; extracting, by the device, excerpts of varying sizes from the set of documents that are relevant to the input query, wherein the extracting filters information irrelevant to the input query from the set of documents; performing, by the device, a ranking of the excerpts based on their relevancy to the input query; and augmenting, by the device and based on the ranking, the input query based on one or more of the excerpts to form a prompt for input to a language model.
2 . The method as in claim 1 , wherein the language model is a large language model (LLM).
3 . The method as in claim 1 , further comprising: providing, by the device and to a user interface, an output generated by the language model in response to the prompt.
4 . The method as in claim 1 , further comprising: ranking the set of documents based on their relevancy to the input query, wherein the device extracts the excerpts based on this ranking.
5 . The method as in claim 1 , further comprising: providing, by the device, the set of documents to a user interface for review; and receiving, by the device and from the user interface, a selection of the set of documents, prior to extracting the excerpts.
6 . The method as in claim 1 , further comprising: generating summaries of the set of documents based on their excerpts, wherein the device uses the summaries to augment the input query.
7 . The method as in claim 1 , wherein the varying sizes comprise at least one of: a singular sentence, a paragraph, or a plurality of paragraphs.
8 . The method as in claim 1 , wherein the device extracts the excerpts from the set of documents based in part on a request associated with the input query to augment it using context-aware retrieval augmented generation (RAG).
9 . The method as in claim 1 , wherein the device retrieves the set of documents from a larger set of documents based on their relevancy to the input query.
10 . The method as in claim 1 , further comprising: storing, by the device, the excerpts for future augmentation of another input query.
11 . An apparatus, comprising: one or more network interfaces; a processor coupled to the one or more network interfaces and configured to execute one or more processes; and a memory configured to store a process that is executable by the processor, the process when executed configured to: retrieve a set of documents based on their relevancy to an input query from a user interface; extract excerpts of varying sizes from the set of documents that are relevant to the input query, wherein the extracting filters information irrelevant to the input query from the set of documents; perform a ranking of the excerpts based on their relevancy to the input query; and augment, based on the ranking, the input query based on one or more of the excerpts to form a prompt for input to a language model.
12 . The apparatus as in claim 11 , wherein the language model is a large language model (LLM).
13 . The apparatus as in claim 11 , wherein the process when executed is further configured to: provide, to a user interface, an output generated by the language model in response to the prompt.
14 . The apparatus as in claim 11 , wherein the process when executed is further configured to: rank the set of documents based on their relevancy to the input query, wherein the apparatus extracts the excerpts based on this ranking.
15 . The apparatus as in claim 11 , wherein the process when executed is further configured to: provide the set of documents to a user interface for review; and receive, from the user interface, a selection of the set of documents, prior to extracting the excerpts.
16 . The apparatus as in claim 11 , wherein the process when executed is further configured to: generate summaries of the set of documents based on their excerpts, wherein the apparatus uses the summaries to augment the input query.
17 . The apparatus as in claim 11 , wherein the varying sizes comprise at least one of: a singular sentence, a paragraph, or a plurality of paragraphs.
18 . The apparatus as in claim 11 , wherein the apparatus extracts the excerpts from the set of documents based in part on a request associated with the input query to augment it using context-aware retrieval augmented generation (RAG).
19 . The apparatus as in claim 11 , wherein the apparatus retrieves the set of documents from a larger set of documents based on their relevancy to the input query.
20 . A tangible, non-transitory, computer-readable medium storing program instructions that cause a device to execute a process comprising: retrieving, by a device, a set of documents based on their relevancy to an input query from a user interface; extracting, by the device, excerpts of varying sizes from the set of documents that are relevant to the input query, wherein the extracting filters information irrelevant to the input query from the set of documents; performing, by the device, a ranking of the excerpts based on their relevancy to the input query; and augmenting, by the device and based on the ranking, the input query based on one or more of the excerpts to form a prompt for input to a language model.

Description

TECHNICAL FIELD The present disclosure relates generally to retrieval augmented generation (RAG) systems and more particularly to intelligent, customizable RAG with contextual compression. BACKGROUND Recent advancements in artificial intelligence models (e.g., language models such as large language models (LLMs)), have opened new possibilities across various industries. Specifically, the ability of these models to follow instructions enables their integration with tools (e.g., plugins) that are able to perform tasks such as searching the web, executing code, etc. Many LLM-based solutions utilize some form of document storage they can query against, e.g., vector databases. This allows for the retrieval of information specific and relevant to the query. For example, in Retrieval Augmented Generation (RAG) systems, the model responds to user queries with reference to a specified set of documents stored in a vector database and uses this information in preference to information drawn from its own large, static training data. Semantic search is customarily used for purposes of this type of information retrieval to select the most relevant documents which will be used to augment the query. However, one challenge with semantic search is that the designer of the search system often does not know the specific queries that users will invoke for retrieval. This means that the information most relevant to a query may be buried in a document along with a lot of irrelevant text. In addition, the retrieved documents may also contain other topics that are somewhat related but not pertinent to the query. If all of these document sections are passed along to the LLM, the LLM may become confused and fail to provide the specific information that the user desires and reducing the accuracy of the system. BRIEF DESCRIPTION OF THE DRAWINGS The implementations herein may be better understood by referring to the following description in conjunction with the accompanying drawings in which like reference numerals indicate identically or functionally similar elements, of which: FIG. 1 illustrates an example computer network; FIG. 2 illustrates an example computing device/node; FIG. 3 illustrates an example of a user interfacing with a language model; FIG. 4 illustrates an example architecture for an artificial intelligence (AI) agent; FIG. 5 illustrates an example workflow for a retrieval augmented generation (RAG) system; FIG. 6 illustrates an example of contextual compression in a RAG system; FIG. 7 illustrates an example user interface for entry of a query; and FIG. 8 illustrates an example of a simplified procedure for generating an output in a RAG system, in accordance with one or more implementations described herein. DESCRIPTION OF EXAMPLE IMPLEMENTATIONS Overview According to one or more implementations of the disclosure, a device retrieves a set of documents based on their relevancy to an input query from a user interface. The device extracts excerpts of varying sizes from the set of documents that are relevant to the input query. The device performs a ranking of the excerpts based on their relevancy to the input query. The device augments, based on the ranking, the input query based on one or more of the excerpts to form a prompt for input to a language model. Other implementations are described below, and this overview is not meant to limit the scope of the present disclosure. Description A computer network is a geographically distributed collection of nodes interconnected by communication links and segments for transporting data between end nodes, such as personal computers and workstations, or other devices, such as sensors, etc. Many types of networks are available, ranging from local area networks (LANs) to wide area networks (WANs). LANs typically connect the nodes over dedicated private communications links located in the same general physical location, such as a building or campus. WANs, on the other hand, typically connect geographically dispersed nodes over long-distance communications links, such as common carrier telephone lines, optical lightpaths, synchronous optical networks (SONET), synchronous digital hierarchy (SDH) links, and others. The Internet is an example of a WAN that connects disparate networks throughout the world, providing global communication between nodes on various networks. Other types of networks, such as field area networks (FANs), neighborhood area networks (NANs), personal area networks (PANs), enterprise networks, etc. may also make up the components of any given computer network. In addition, a Mobile Ad-Hoc Network (MANET) is a kind of wireless ad-hoc network, which is generally considered a self-configuring network of mobile routers (and associated hosts) connected by wireless links, the union of which forms an arbitrary topology. FIG. 1 is a schematic block diagram of an example simplified computing system (e.g., the computing system 100), which includes client devices 102 (e.g., a fi