CN-121787593-B - Large language model segmentation hybrid reasoning method based on uncertain drive

CN121787593BCN 121787593 BCN121787593 BCN 121787593BCN-121787593-B

Abstract

The invention relates to a large language model segmentation hybrid reasoning method based on uncertain driving, which comprises the steps of obtaining current text data and historical state characteristics to estimate uncertainty indexes of current segments, minimizing unified scheduling objective functions according to the uncertainty indexes to obtain target reasoning modes, executing reasoning calculation according to the target reasoning modes to generate information contribution degrees corresponding to key values, calculating corresponding dynamic merging control probabilities according to the uncertainty indexes and the information contribution degrees, executing weighted merging or pruning on key value pairs to be merged to obtain compressed key value pairs, defining deviation measures, limiting the deviation measures not to exceed preset upper limits determined by a dynamic merging control probability set and the uncertainty indexes, triggering rollback processing when the deviation measures are exceeded, otherwise, feeding back the state of the compressed key value pairs to the next segments to conduct cyclic iteration until reasoning of all segments is completed, and accordingly reducing display memory occupation and reasoning delay on the premise of guaranteeing accuracy.

Inventors

ZHU YI
Su Tingjun

Assignees

厦门大学

Dates

Publication Date: 20260508
Application Date: 20260303

Claims (7)

1. The large language model segmentation hybrid reasoning method based on uncertain driving is characterized by comprising the following steps of: Acquiring text data and historical state characteristics of a current input request, and pre-estimating uncertainty indexes of a current segment according to the historical state characteristics, wherein the historical state characteristics comprise attention distribution, hidden state statistical characteristics and system running state characteristics of the previous reasoning time step or feature vectors obtained through calculation of a lightweight prior network; Constructing a unified scheduling objective function comprising the memory occupation cost, the reasoning delay cost and the uncertainty risk, and minimizing the unified scheduling objective function according to the uncertainty index so as to select a target reasoning mode of a current segment from a preset reasoning mode set, wherein the reasoning mode set comprises full reasoning, mixed reasoning and incremental reasoning; Performing inference calculation on the text data of the current segment according to the target inference mode, generating key value pairs, and calculating information contribution degree of each generated key value pair; Calculating the dynamic combination control probability of each key value pair according to the uncertainty index and the information contribution degree, and executing weighted combination or pruning on the key value pairs to be combined according to the sampling or deterministic selection result to obtain compressed key value pairs; Defining a deviation measure between the compressed output distribution and the full-scale inference output distribution, and defining that the deviation measure does not exceed a preset upper bound determined by a dynamic combination control probability set and an uncertainty index; Triggering rollback processing to adjust merging intensity and/or scheduling weight when the deviation measure exceeds the preset upper bound, and feeding back the compressed key value pair state to the next segment for cyclic iteration until the reasoning of all segments is completed when the deviation measure does not exceed the preset upper bound; Wherein the uncertainty index adopts one or more of attention entropy, logic entropy, top-k margin or self-consistency divergence; wherein, the unified scheduling objective function is obtained according to the following formula: where m represents a candidate inference mode, VRAM (m) represents a memory footprint when executing m-mode, latency (m) represents an inference delay, risk (m) represents a Risk measure due to low accuracy or low context coverage, 、 Representing the normalization constant(s), 、、 The weight coefficient is represented by a number of weight coefficients, Representing an uncertainty index.
2. The large language model segmentation mixed reasoning method based on uncertain driving according to claim 1, wherein the information contribution degree is obtained by accumulating attention weights, gradient approximate sensitivities or information gain calculations of the ith token in the current segment and the subsequent segment, so as to characterize the influence degree of key values on subsequent reasoning output.
3. The large language model piecewise hybrid reasoning method based on uncertain driving according to claim 1, wherein the dynamic merge control probability is obtained according to the following formula: Wherein, the The Sigmoid function is represented as a function, The degree of contribution of the information is indicated, An indicator of the uncertainty is indicated, Represents a positive number that prevents the denominator from being zero, Representing adjustable parameters, and The symbols and the numerical values of (2) are used for setting And (3) with Is a monotonic relationship of (c).
4. The large language model piecewise hybrid reasoning method based on uncertain driving according to claim 1, wherein the deviation of the output distribution from the full-scale reasoning output distribution after the Kullback-Leibler divergence or Jensen-Shannon divergence metric is compressed is adopted.
5. The large language model piecewise hybrid reasoning method based on uncertain driving of claim 1, wherein the uncertainty risk item weight in the unified scheduling objective function is increased when the deviation metric is detected to exceed a preset threshold And triggers an inference mode switch to favor selection of a more robust full-scale inference mode.
6. A computer readable storage medium having stored thereon a large language model segmentation hybrid inference program based on an uncertain drive that when executed by a processor implements the large language model segmentation hybrid inference method based on an uncertain drive as claimed in any of claims 1-5.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the large language model segmentation hybrid reasoning method based on uncertain drives according to any of claims 1-5 when the computer program is executed by the processor.

Description

Large language model segmentation hybrid reasoning method based on uncertain drive Technical Field The invention relates to the technical field of artificial intelligence, in particular to a large language model segmentation hybrid reasoning method based on uncertain driving, a computer readable storage medium and a computer device. Background As the use of large language models in natural language processing and generative artificial intelligence continues to expand, model reasoning typically includes a "pre-fill phase of the previous hints (full-scale reasoning)" and a "token-by-token generation phase (incremental reasoning)". In a long context scenario, the pre-fill phase can significantly lengthen latency and raise memory pressure, thereby affecting reasoning throughput and service stability. In the transform decoding model, a self-attention mechanism can store Key-Value Cache (KV Cache for short) in the reasoning process, and the Key-Value Cache occupies approximately linear growth along with the sequence length, so that the Key-Value Cache becomes a main video memory bottleneck of long-sequence reasoning. In order to alleviate the problems of display memory bottleneck and long text reasoning waiting, a type of segmented reasoning scheme based on preset length appears in the prior art, wherein long text is segmented into a plurality of segments according to the preset length, and full-scale reasoning, mixed reasoning and/or incremental reasoning are performed on each segment in batches so as to reduce peak display memory occupation and improve throughput. For example, in one common implementation, a longer request is divided into multiple segments according to a fixed length (e.g., a request with a length of about 8000 is split into 4 segments of about 2000), the previous segment is inferred, a history result is produced, and then the subsequent segments are combined with newly arrived request information, and then hybrid reasoning is performed, so that the short input is serviced in parallel as much as possible while the long input is processed. However, the segmentation strategy of the "preset length/fixed rule" described above has the obvious disadvantage that, first, it lacks content perception. The fixed segmentation ignores the difference of semantic density and logic dependence of an input text, key context fracture is easy to be caused at the boundary of logic complexity or strong dependence, so that inference drift or illusion is induced, and the second is resource waste. In the low-difficulty and low-entropy fragments, the fixed strategy cannot further shrink the resource overhead, so that the reasoning efficiency still has a large improvement space, and the closed-loop control is lacking. The existing scheme focuses on how to segment/how to do full-scale and mixed reasoning arrangement among segments, but the lower KV cache is not compressed or combined into a unified decision framework, so that errors are uncontrollable when the two are overlapped. Meanwhile, another prior art attempts to compress or merge KV caches, such as discarding key-value pairs of partial token based on a fixed threshold, fixed ratio, or random mask, and then performing mean/weighted mean merge to reduce the cache size. On the other hand, for KV cache compression, the prior art generally decides on the retention or discarding of key-value pairs based only on local Attention scores (Attention scores). For example, by calculating the attention of the current Query to the historical Key, a fixed threshold is set, and KV pairs with attention lower than the threshold are directly deleted. However, this "local attention" based compression strategy has the essential disadvantage that it is a "local greedy" strategy, ignoring the global confidence (i.e., uncertainty) of the model generation. When the model processes a text paragraph with high ambiguity and difficulty (when the model is in a high uncertainty state), even the "edge information" with low attention score can play a key error correction role for subsequent reasoning. If the excitation compression is still mechanically carried out according to the attention threshold at this time, the robustness of the semantics is easily destroyed, so that the model generates illusion or logic collapse. Conversely, when the model is extremely confident (low uncertainty), the strategy is often too conservative to maximize the release of video memory. Disclosure of Invention The present invention aims to solve at least to some extent one of the technical problems in the above-described technology. Therefore, the invention aims to provide a large language model segmentation hybrid reasoning method based on uncertain driving, which can incorporate segmentation reasoning mode selection and KV cache compression/combination into the same closed-loop scheduling frame, so that a system can dynamically adjust the reasoning mode and the compression strength according to the current uncertai