CN-121980066-A - Multi-knowledge-base dynamic routing method and system based on reinforcement learning

CN121980066ACN 121980066 ACN121980066 ACN 121980066ACN-121980066-A

Abstract

The invention discloses a multi-knowledge-base dynamic routing method and system based on reinforcement learning, and belongs to the technical field of artificial intelligence. Aiming at the problems that in the existing retrieval enhancement generation technology, the total-library retrieval resource consumption is large, the noise is strong, and the static route cannot adapt to the updating of the knowledge base, the invention provides the method for constructing the knowledge base environment model and obtaining the static attribute and the dynamic state characteristic of each knowledge base; the method comprises the steps of constructing state feature vectors through multi-modal feature fusion aiming at user inquiry, calculating selection scores of all knowledge bases by utilizing an improved confidence upper bound algorithm, wherein the algorithm comprises utilization items and exploration items to balance utilization and exploration, determining target bases according to the scores to execute retrieval and generate answers, and finally updating model parameters based on rewarding functions. The invention aims to automatically select the optimal knowledge source according to the user intention and the real-time state of the knowledge base, improve the response speed and accuracy of the question-answering system, reduce the resource consumption and have cold start self-adaption capability.

Inventors

ZHANG BO
Lu Guangbing
LIU XIAOHONG
Xiong Zhue
CHI PEI
LUO JIAYAN
CHEN YUN
LIU XINFENG
ZHU JIANKUN
WEI YI
CUI LEIMING

Assignees

重庆国际复合材料股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. The multi-knowledge-base dynamic routing method based on reinforcement learning is characterized by comprising the following steps of: Defining a multisource knowledge base set, and acquiring static attribute characteristics and dynamic state characteristics of each knowledge base; Aiming at a current user query request, a state feature vector is constructed through multi-mode feature fusion, wherein the state feature vector comprises query semantic features, knowledge base semantic coverage features, knowledge base timeliness features and historical interaction features, and the historical interaction features comprise the total number of times of historical selection of each knowledge base up to the current moment and historical average rewards; Constructing a routing decision model by adopting an improved confidence upper bound algorithm based on a context multi-arm slot machine framework, and calculating a selection score of each knowledge base through the model, wherein the improved confidence upper bound algorithm comprises an utilization item and an exploration item; Screening a target knowledge base according to the selection scores, and executing search operation in the target knowledge base to generate answers; and calculating a reward value according to the search result and the generated answer, and updating parameters of the routing decision model based on the reward value.
2. The reinforcement learning-based multi-knowledge-base dynamic routing method of claim 1, wherein the knowledge-base semantic coverage feature is used to characterize a user query request q and a knowledge base The overlapping degree of the content feature space is calculated by the following formula: ; In the formula, The meaning of the semantic coverage is indicated, A query semantic vector representing a user query request q, Representing a knowledge base Is a feature centroid vector of (a); wherein, the characteristic centroid vector of the knowledge base ki The specific calculation steps of (a) are as follows: document feature coding, namely, a pre-trained language model is adopted to code a knowledge base All documents or document slices in the knowledge base are mapped into dense semantic vectors of d dimensions one by one, and a vector set of the knowledge base is constructed; And calculating mass center aggregation, namely calculating arithmetic average value of all vectors in the vector set along each characteristic dimension to obtain a central coordinate vector of the knowledge base in the whole characteristic space, namely a characteristic mass center vector.
3. The reinforcement learning based multi-knowledge-base dynamic routing method of claim 2, wherein said knowledge-base timeliness feature quantifies the knowledge base using an exponential decay model The medium knowledge freshness degree has the following calculation formula: ; Wherein, the The characteristic of the time-lapse property is represented, In order for the attenuation coefficient to be a factor, Is a knowledge base Average update time interval of the documents related to the user query request is the number of days at the current moment; Extracting N candidate documents with the top ranking of the matching degree scores of the preliminary search with the query request in the knowledge base, respectively obtaining the last update time stamp of each document, calculating the number of days of the time stamps from the current moment, and taking the arithmetic average value of the numerical values of the N days as the arithmetic average value Is added to the final value of (a).
4. The reinforcement learning based multi-knowledge-base dynamic routing method of claim 1, wherein said improved confidence upper bound algorithm computes a selection score as: ; Wherein, the Representing the selection score of the ith knowledge base at time t, To utilize items, a knowledge base is represented The historical average prize value prior to time t, To explore the term, T represents the total number of requests currently handled by the system, Representing a knowledge base The historical total number of times selected, c, is the exploration constant used to adjust the hyper-parameters of the exploration weight.
5. The reinforcement learning based multi-repository dynamic routing method of claim 1, wherein the prize value The calculation formula of (2) is as follows: ; Wherein, the Representing a text overlap score between the generated answer and the standard reference answer, Represents the time overhead from the start of the routing request to the generation of the final answer, For the fact accuracy score to be a score, For penalty terms when a routing error results in no result, , , The weight coefficients of the indexes are respectively obtained.
6. The method for dynamically routing the multi-knowledge base based on reinforcement learning according to claim 1 is characterized in that a state feature vector is built through multi-mode feature fusion, wherein query semantic vectors of a user query request and metadata features of the knowledge base are spliced and mapped through a multi-layer perceptron MLP, high-dimensional sparse features are compressed into low-dimensional dense routing vectors and merged into the state feature vector, and the metadata features of the knowledge base comprise labels, document types and access authority levels of the domain to which the knowledge base belongs.
7. The method for dynamically routing multiple knowledge bases based on reinforcement learning according to claim 1, wherein said selecting a target knowledge base according to said selection score is specifically selecting a selection score of a knowledge base The largest knowledge base is used as the target knowledge base 。
8. The reinforcement learning based multi-repository dynamic routing method according to claim 1 or claim 7, wherein when the maximum value of the selection score is below a preset security threshold When the multi-library joint search mode is triggered, the method comprises the following steps: Parallel searching and de-duplication, namely selecting a knowledge base with the top m of the selection score ranking to perform searching in parallel to obtain multiple paths of candidate document fragments; the method comprises the steps of cross sequencing and correlation score calculation, namely inputting de-duplicated document fragments into a cross coder re-sequencing model, and calculating deep semantic matching probability between each document fragment and a query request q to be used as a correlation score, or adopting a reciprocal rank fusion algorithm to calculate a comprehensive correlation score according to the original ranking of the document in each library of search results; the feedback signal calculation, namely extracting the average relevance score Rel avg of the Top-k documents after cross sequencing, taking the average relevance score Rel avg as a forward rewarding compensation term, and linearly superposing the forward rewarding compensation term to rewarding value Calculated, i.e. rewarding value The updating is as follows: ; Wherein, the Representing a text overlap score between the generated answer and the standard reference answer, Represents the time overhead from the start of the routing request to the generation of the final answer, For the fact accuracy score to be a score, For penalty terms when a routing error results in no result, , , Respectively the weight coefficients of the indexes, Weights are compensated for correlation.
9. The reinforcement learning-based multi-knowledge-base dynamic routing method of claim 1, wherein the updating of parameters of the routing decision model based on the reward values adopts an incremental updating mode, and the calculation formula is: ; ; Wherein, the To update the accumulated number of selections of the knowledge base, And (5) updating the historical average rewards value of the knowledge base.
10. A reinforcement learning based multi-repository dynamic routing system for implementing the method of any of claims 1 to 9, comprising: The feature engineering module is used for acquiring static attribute features and dynamic state features of the multi-source knowledge base and constructing a state feature vector through multi-mode feature fusion according to a user query request; The routing decision agent is internally provided with a routing decision model based on an improved confidence upper bound algorithm and is used for calculating the selection score of each knowledge base according to the state feature vector and outputting a target knowledge base selection instruction; The search execution module is used for responding to the selection instruction, executing search operation in the appointed target knowledge base, returning candidate documents and generating corresponding answers; And the feedback evaluation module is used for calculating the rewarding value of the current round according to the search result and the generated answer data, feeding back the rewarding value to the routing decision agent and finishing the parameter updating of the routing decision model.

Description

Multi-knowledge-base dynamic routing method and system based on reinforcement learning Technical Field The invention relates to the technical field of artificial intelligence and natural language processing, in particular to a multi-knowledge-base dynamic routing method and system based on reinforcement learning. Background With the rapid development of large language model technology, a retrieval enhancement generation (RAG) technology has become a mainstream technical scheme for solving the model illusion problem and supplementing private domain knowledge. In an actual enterprise-level application scenario, knowledge data is typically dispersed among multiple heterogeneous knowledge bases, such as structured databases, unstructured operation and maintenance document bases, code warehouses, meeting summary bases, financial statement bases, and the like. How to accurately and efficiently retrieve information related to user queries from these multi-source heterogeneous knowledge bases is a core challenge faced by current knowledge base question-answering systems. When the prior knowledge base question-answering system processes a multi-source knowledge base, the following two retrieval strategies are generally adopted: The first is a full library search strategy that does not distinguish the query intent of the user, and searches in parallel in all connected knowledge bases directly. Although the method theoretically ensures the highest recall rate, the method has the serious defects that firstly, the calculation resource consumption is extremely high, the system response time delay is linearly increased along with the increase of the number of knowledge bases, the real-time question and answer needs are difficult to meet, and secondly, homonymic and heteronymic words (such as apples with distinct meanings in a fruit supply chain library and an electronic equipment library) often exist in the knowledge bases in different fields, and the whole-library retrieval can introduce a large number of cross-field noise documents, so that the confusion is generated when a large model generates answers, and the answer accuracy is seriously reduced. The second is a rule or static classifier based routing strategy. The strategy typically selects the target knowledge base by pre-defining keyword matching rules, or training a supervised learning classifier (e.g., BERTC lassifier). Although the searching range is reduced, the method has the obvious limitations that firstly, the routing logic is statically solidified, the real-time updating and the newly added knowledge base of the content of the knowledge base cannot be perceived, the rule and the retraining model are required to be repeatedly modified manually, the maintenance cost is high, the flexibility is poor, for example, when a certain knowledge base is newly added with documents in a specific field or the content of the certain knowledge base is not updated for a long time to cause information to pass, the static classifier cannot automatically perceive and route related problems, and is difficult to adapt to the scene of the newly added knowledge base, and secondly, the traditional method cannot optimize the routing logic according to implicit feedback (such as praise and pursuit) or explicit feedback of a user, so that the system cannot continuously improve the decision quality from history interaction. Therefore, there is a need for a dynamic routing mechanism that can understand the user's semantics in depth, perceive the real-time state of the knowledge base (e.g., content freshness, semantic coverage), and continuously optimize itself by interacting with the environment, so as to achieve accurate and efficient automatic knowledge base selection. Disclosure of Invention Aiming at the defects existing in the prior art, the invention provides a multi-knowledge-base dynamic routing method and a system based on reinforcement learning, which solve the technical defects that in the prior art, under a multi-knowledge-base retrieval scene, the whole-base retrieval resource consumption is large, the answer accuracy is low due to strong cross-domain noise interference, the static routing strategy cannot perceive the real-time updating of knowledge base contents, the autonomous evolution capability is not available, the defects of cold start and long-tail knowledge base coverage are available, and meanwhile, the traditional routing only depends on keyword matching to have semantic gap, and the answer accuracy and response efficiency cannot be considered. According to an embodiment of the invention, a multi-knowledge-base dynamic routing method based on reinforcement learning comprises the following steps: Defining a multisource knowledge base set, and acquiring static attribute characteristics and dynamic state characteristics of each knowledge base; Aiming at a current user query request, a state feature vector is constructed through multi-mode feature fusion, wherein t