CN-121981251-A - Multi-level cache and self-feedback driven retrieval enhancement generation reasoning acceleration system

CN121981251ACN 121981251 ACN121981251 ACN 121981251ACN-121981251-A

Abstract

The invention discloses a multi-level cache and self-feedback driven retrieval enhancement generation reasoning acceleration system which comprises an input processing module, a multi-level cache management module, a retrieval enhancement generation module and a self-feedback optimization module. The multi-level cache management module adopts a three-level cache structure and is used for pre-filling and caching completely matched historical interaction data, similar matched correlation reasoning results and knowledge base hot knowledge, redundancy calculation is reduced through multiplexing cache data in the reasoning stage, the retrieval enhancement generation module has the dual functions of retrieving the cache or the knowledge base based on query vectors generated by user requests, fusing the retrieval results and user inputs to generate enhancement prompt words, driving a large language model to reason so as to generate targeted output, and actively identifying knowledge base hot knowledge when no user requests exist, generating knowledge arrangement prompt words through a preset template, driving the large language model to generate pretreatment results (such as a core abstract and a preset question-answer pair) and storing the pretreatment results in the three-level cache. The self-feedback optimization module dynamically adjusts the retrieval parameters, the generation parameters and the caching strategy by evaluating the accuracy, the relevance and the timeliness of the generated result to form closed loop optimization, so that invalid calculation and redundant information generation are reduced, the reasoning speed and the output quality are improved, and the large language model collaborative efficiency is remarkably improved.

Inventors

CHEN ZHIHAO
ZHU FUSHENG
LAI WENBIN
LIAO SHUJING
FU JIEWEI

Assignees

广东省新一代通信与网络创新研究院

Dates

Publication Date: 20260505
Application Date: 20251209

Claims (10)

1. The multi-level cache and self-feedback driven retrieval enhancement generation inference acceleration system is characterized by comprising an input processing module, a multi-level cache management module, a retrieval enhancement generation module and a self-feedback optimization module, wherein: the input processing module is used for receiving multi-mode data input by a user, and converting the multi-mode data into query vectors bearing semantic features after preprocessing; The multi-level cache management module comprises a first-level cache, a second-level cache and a third-level cache, wherein the first-level cache, the second-level cache and the third-level cache are used for storing historical interaction data and hot spot knowledge with different matching degrees in a layered manner according to the priority from high to low, the first-level cache is used for storing historical prompt information which is completely matched with a current query vector and corresponding generation results, the second-level cache is used for storing the historical prompt information which is similar to the current query vector and is higher than a preset threshold value and corresponding generation results, and the third-level cache is used for storing hot spot information retrieved from a knowledge base; The retrieval enhancement generation module comprises a knowledge base, a retrieval module, a generation module and a large language model reasoning engine, and has the functions of, on one hand, when user input exists, searching a cache or the knowledge base based on a query vector generated by an input processing module, fusing a retrieval result with the user input to generate an enhancement prompt word, driving the large language model to generate an output result aiming at a user request, and on the other hand, when no user input exists, actively identifying a hot knowledge fragment in the knowledge base through the retrieval module, generating a knowledge arrangement prompt word based on a preset template by the generation module, driving the large language model to generate a preprocessing result (such as a core abstract and a preset question-answer pair) and storing the preprocessing result into a three-level cache to accelerate a subsequent related request response. The self-feedback optimization module is used for carrying out multidimensional evaluation on the accuracy, the relevance and the timeliness of the generated result, dynamically adjusting the retrieval parameters, the generated parameters and the caching strategy based on the evaluation result, and realizing closed-loop optimization without human intervention.
2. The system of claim 1, wherein the preprocessing of the input processing module comprises performing word segmentation, word de-activation, punctuation clean-up and entity normalization on the text input, semantically completing the phonetic text result, and wherein the query vector is generated by embedded model conversion for capturing core semantics of the user input.
3. The system of claim 1, wherein the hit determination logic of the multi-level cache management module is a first-level cache hit having a cosine similarity of 1 (perfect match) between a current query vector and a historical query vector, a second-level cache hit having a similarity of greater than a predetermined threshold between the current query vector and the historical query vector, and a third-level cache hit having a similarity of greater than a predetermined threshold between the current query vector and a hotspot information association vector.
4. The system of claim 1, wherein the update and eviction policy of the multi-level cache management module comprises the level one cache and the level two cache evicting long-term non-accessed data based on a least recently used policy.
5. The system of claim 1, wherein the knowledge base supports both networking and non-networking states and is adaptively matched with an enabling state of the tertiary cache, wherein when the knowledge base is a networking knowledge base, the tertiary cache is enabled to synchronize hot spot information updating of the knowledge base in real time, and when the knowledge base is a non-networking knowledge base, the tertiary cache is closed to avoid local resource redundancy.
6. The system of claim 1, wherein the search enhancement generation module comprises search logic for sequentially searching the search module according to the priority of the first-level buffer memory, the second-level buffer memory, the third-level buffer memory (if enabled) and the knowledge base, searching the semantically related knowledge segments in the knowledge base through the vector database if the buffer memory is not hit, and reordering by combining with authoritative and timeliness labels, the generation module fuses the original input of the user with the searched buffer result or knowledge segments to construct an enhancement prompt word, and the large language model reasoning engine loads a pre-trained large language model to generate an output result based on the enhancement prompt word.
7. The system of claim 1, wherein the evaluation dimension of the self-feedback optimization module is specific. The method comprises the steps of generating a matching degree of a result and knowledge base facts, generating a correlation degree of the result and the original input semantic of a user, and generating validity of dynamic knowledge in the result.
8. The system of claim 1 wherein the dynamic adjustment strategy of the self-feedback optimization module comprises writing the current interaction data into the corresponding cache layer to maintain the current search parameters and the generation parameters when the quality of the generated result is evaluated as high quality, increasing the search range, decreasing the second level cache similarity threshold and lowering the generation temperature to enhance the output stringency when the quality of the generated result is evaluated as medium, automatically triggering cache refresh (clearing low confidence data), adjusting the search vector generation strategy (recoding user input), and opening a large language model repetition penalty mechanism when the quality of the generated result is evaluated as low quality.
9. The system of claim 6, wherein the enhanced prompt word is constructed by fusing the user's original input with the retrieved knowledge segments or cached results, including explicit instructions based on knowledge content generation, prior to use of up-to-date data, etc., to constrain the large language model generation logic.
10. The system of claim 1, wherein the three-level cache of the multi-level cache management module is further configured to store a preprocessing result actively generated based on hot spot knowledge, including a core information abstract and a preset question-answer pair, where the preprocessing result drives a large language model to be generated through a knowledge-arrangement type prompt word, and is used to accelerate a response of a subsequent related user request.

Description

Multi-level cache and self-feedback driven retrieval enhancement generation reasoning acceleration system Technical Field The invention relates to the technical field of artificial intelligence reasoning acceleration, in particular to a multi-level cache and self-feedback driven retrieval enhancement generation reasoning acceleration system. Background Large Language Models (LLM) have been widely used in the fields of intelligent question-answering, dialogue interaction, etc., by virtue of their powerful semantic understanding and generating capabilities. However, the reasoning process of the large language model has the problems of high computational complexity, large response delay, remarkable hardware resource consumption and the like, is difficult to be efficiently deployed in resource-limited scenes such as edge equipment, mobile terminals and the like, and severely restricts the application range of the system. The retrieval enhancement generation (RAG) technology effectively makes up the defect of the large language model in the fact accuracy by introducing an external knowledge base retrieval auxiliary generation process, and becomes a key scheme for improving the output reliability. However, the existing search enhancement generation technology still has the following prominent problems: The redundant calculation of the system is remarkable in that repeated or similar requests of users easily trigger the same knowledge base searching and model generating process, so that a large number of repeated calculations are caused, the response delay of the system is increased, and the waste of calculation force resources is caused. The system cannot autonomously diagnose reasoning deviation or repair output defects, so that content stability and reliability are limited, and the system is difficult to adapt to dynamically-changed complex scenes (such as knowledge real-time update, user demand diversification and the like). Therefore, how to reduce redundant computation and construct a search enhancement generation system with self-feedback optimization capability becomes a key problem for improving the applicability of a large language model in a resource-constrained scene. Disclosure of Invention The invention aims to provide a search enhancement generation inference acceleration system for multi-level caching and self-feedback inference, which is used for reducing the repeated calculation probability and response delay of the system while enhancing the content generation accuracy of a large language model through a hierarchical caching multiplexing and dynamic optimization mechanism, is suitable for resource-limited scenes such as edge equipment and mobile terminals, and solves the problems of more redundant calculation, dependence on manual optimization, poor suitability and the like in the existing search enhancement generation technology. In order to realize the tasks, the invention adopts the following technical scheme: The input processing module 101 is responsible for converting unstructured user input into structured semantic vectors that can be used for retrieval, and provides accurate basis for subsequent cache matching and knowledge base retrieval. The module supports multi-mode user requests such as text, voice (processed by text conversion) and the like, is compatible with various input forms such as natural language questions, instruction type inquiry and the like, and adapts to diversified interaction scenes. Aiming at different types of input data, the input processing module executes targeted processing, for text input, word segmentation, word stopping, punctuation cleaning and entity normalization are sequentially carried out, so that the extracted core information is free from ambiguity, and for speech-to-text results, semantic completion processing (such as sentence breaking errors caused by correction of speech recognition errors and filling of omitted components) is supplemented, so that the influence of recognition deviation on subsequent semantic understanding is avoided. After preprocessing is completed, the input processing module converts the text into a high-dimensional query vector through a pre-trained embedded model, and the vector accurately captures deep semantics input by a user through numerical distribution in a mathematical space, so that a foundation is laid for accuracy of cache matching and knowledge base retrieval. The output of the input processing module directly correlates the efficiency of subsequent retrieval and reasoning, the more accurate the vector captures the semantics, the higher the cache hit probability, the lower the redundancy of the knowledge base retrieval, and the overall response speed of the system is also improved. The multi-level cache management module 102 maximizes the multiplexing history calculation result and reduces invalid repeated calculation by hierarchically storing different types of inference data. The module adopts a three-level ca