Search

CN-121996396-A - MULTI-LoRA-oriented large-model memory optimal configuration system

CN121996396ACN 121996396 ACN121996396 ACN 121996396ACN-121996396-A

Abstract

A MULTI-LoRA-oriented large-model memory optimal configuration system comprises a tree-structure-based memory block manager and a benefit-based memory block scheduler, wherein the memory block manager uniformly divides a GPU memory and a main memory into memory blocks with the same size to fill the whole memory space, stores LoRA weight and KV cache according to granularity of the memory blocks, and dynamically migrates LoRA weight and KV cache according to benefits by continuously evaluating benefits of reserving a specific memory block in the GPU memory or the main memory by the memory block scheduler. Aiming at a multi LoRA reasoning scene, the invention efficiently manages and schedules LoRA weight and KV cache in the GPU memory and the main memory so as to improve the overall resource utilization efficiency.

Inventors

  • ZHANG XING
  • Shi Jiuchen
  • CHEN QUAN
  • GUO MINYI

Assignees

  • 上海交通大学

Dates

Publication Date
20260508
Application Date
20241104

Claims (10)

  1. 1. A MULTI-LoRA-oriented large-model memory optimal configuration system is characterized by comprising a tree-based memory block manager and a benefit-based memory block scheduler, wherein the memory block manager uniformly divides a GPU memory and a main memory into memory blocks with the same size to fill up the whole memory space, and stores LoRA weight and KV cache according to the granularity of the memory blocks; In the tree structure, the first layer is LoRA weight, the subtree is prefix tree generated by using the LoRA request, each node in the tree structure corresponds to an actual memory block and KV cache corresponding to a statement block in the memory system, and each path corresponds to the prefix of a request statement.
  2. 2. The MULTI-LoRA-oriented large-model memory optimal configuration system according to claim 1, wherein the memory block manager comprises an executor and a unified memory block tree, wherein the executor performs block division according to a request sent by a user, the unified memory block tree performs prefix history cache matching layer by layer in the tree according to a block division result, the matching result is returned to the executor and required history cache is replaced into a GPU memory from a main memory, the executor performs request information update according to the matching result, transmits the updated request to a LLM reasoning engine to reason, inserts a reasoning result obtained by the LLM reasoning engine and a new KV cache into the unified memory block tree, and the unified memory block tree periodically counts the use condition of each memory block and sends the statistics result to a load monitor in a memory block scheduler.
  3. 3. The MULTI-LoRA-oriented large-model memory optimal configuration system according to claim 1, wherein the memory block scheduler comprises a load monitor and a benefit estimator, wherein the load monitor judges whether the memory block needs to be swapped in and swapped out and corresponding transmission quantity according to the memory block condition sent by the memory block manager, and the benefit estimator calculates to obtain a final swap-in or swap-out plan according to the transmission quantity, transmission cost and access frequency of each memory block and a two-stage scheduling strategy and returns the plan to the memory block manager.
  4. 4. A method for optimizing configuration of a large model memory based on the system of any one of claims 1 to 3, comprising: the method comprises the steps of initializing a memory management system, namely partitioning GPU memory and a main memory according to the size of a unified memory block, arranging the GPU memory and the main memory into a unified index system, constructing a unified memory block tree for managing global memory blocks, and starting a memory block scheduler; Step two, the memory block manager performs LoRA weight and history KV cache matching according to a request initiated by a user, all the required memory blocks are replaced into the GPU memory, and the matched memory blocks are returned to an upper layer for large model reasoning; step three, when the reasoning of the request is finished, the memory block manager inserts the newly generated KV cache into a unified memory block tree for later multiplexing; Continuously detecting the load conditions of the GPU memory and the main memory, and carrying out dynamic scheduling according to the current memory conditions, wherein the method specifically comprises the following steps: a) When the consumption of the GPU memory is smaller than the lower threshold value low_threshold, CPU memory blocks are replaced according to two-level memory management, namely, the weight is replaced by LoRA until the LoRA weight number in the GPU memory reaches N L , after the latest used memory block is replaced by priority, when the consumption of the GPU memory is still smaller than the low_threshold, each memory block is evaluated according to a benefit evaluator, and the CPU memory block with the biggest benefit is replaced until the low_threshold is reached; b) When the using amount of the GPU memory is larger than the upper threshold value up_threshold, the cache blocks are swapped out according to two-level memory management, namely, the memory blocks are swapped out from the GPU memory continuously until the using amount of the GPU memory reaches up_threshold, and the memory blocks with the smallest benefits are swapped out preferentially when being swapped out; c) When the main memory usage amount is larger than the upper threshold value up_threshold, the destroying operation is started, namely the scheduler can continuously select the CPU block with the smallest benefit to destroy until up_threshold is reached.
  5. 5. The method for optimizing configuration of large model memory as claimed in claim 4, wherein in step 1), after receiving new request, the memory block manager based on tree structure divides the request, and uses executor to query unified memory block tree, i.e. first matches LoRA weight required by request in first layer of tree structure, when matching weight, matches KV cache required by request layer by layer in turn until it is unable to match, specifically, after request comes, first needs to divide the request according to preset memory block size, then matches with memory block manager.
  6. 6. The method for optimizing configuration of large model memory according to claim 4, wherein the layer-by-layer sequential matching is that first match LoRA weight of first layer, when LoRA weight is in main memory, put the corresponding memory block into the swap stream, finish matching, otherwise match successfully, then match the first memory block in second layer, if not match, finish matching, otherwise check the state of the matched block, if the matched memory block is in main memory, put the block into swap stream, then sequentially process all blocks, repeat the process until it is unable to match the memory block, through the matching process, request the buffer blocks already calculated in the system as multiplexing as possible, which improves the utilization efficiency of the memory, reduces repetitive calculation; the request comprises an operation instruction sent by a user, texts to be processed and LoRA names or serial numbers to be used, and the continuous memory blocks on the matching are directly multiplexed in the following reasoning without recalculation.
  7. 7. The method for optimizing configuration of large model memory according to claim 4, wherein in step 2), after the request reasoning is finished, the memory block manager based on tree structure obtains new KV cache based on autoregressive reasoning mechanism calculated by LLM reasoning engine, and inserts the newly generated KV cache block into unified memory block tree layer by using executor, specifically, splicing input and output as final sequence, performing prefix matching layer by layer in sequence until reaching a certain layer of tree, and inserting all remaining memory blocks which cannot be matched into tree by taking the layer as root.
  8. 8. The method of claim 4, wherein the profit assessment includes (i) reserving memory blocks in GPU memory as compared to the benefits of storing them in main memory, i.e., GPU reserved benefits, and (②) reserving memory blocks in main memory as compared to the benefits of destroying them, i.e., main memory reserved benefits; The GPU reserves benefits reserved_gpus_reorder i =cost_band i *frequency i *(1-Sigmoid(Tlru i ), wherein cost_band i is transmission overhead of a memory block i between a main memory and a GPU memory, frequency i is use frequency of the memory block i, tlru i is latest use time of the memory block i, sigmoid is an activation function commonly used in a neural network, and for cost_band i ,cost_band i =M i /Bandk, M i is size of the memory block i, and Bandk is transmission rate from the GPU memory to the main memory in the system; the main memory keeps the benefit The cost_ recompute i is an additional overhead caused by recalculating the memory block i, specifically :cost_recompute i =T_prefill(context i +len i )-T_prefill(context i ),T_prefill(l) is an overhead of a sequence with the processing length of l, the overhead is obtained through pre-sampling and modeling, context i is a length of the memory block i, len i is a length of the memory block i, and the rest variables have the same meanings.
  9. 9. The method for optimizing configuration of large model memory according to claim 4, wherein said two-stage memory management comprises: i) Sampling the use condition of LoRA weights in the latest period, and marking the number of LoRA to be used for each round of reasoning average as N L ,N L as the sampling lower limit of LoRA numbers in the GPU memory, wherein the number of LoRA weights in the GPU memory should be larger than N L at any time, so that the number of LoRA weights is prevented from becoming the bottleneck for limiting throughput; ii) under the condition that N L is met, when redundant space exists in the memory, unified evaluation and replacement are carried out on the memory blocks in the main memory, the evaluation at this stage mainly considers the additional cost of replacing the memory blocks before reasoning, TTFT is reduced as much as possible, and TTFT can be reduced as much as possible and relevant request indexes are improved under the condition of keeping high throughput through a two-stage memory management mode.
  10. 10. The method of claim 4, wherein the periodic operation is performed according to a predetermined monitoring interval, the monitoring interval is set by an inference service provider, and is generally close to a single inference delay request, so as to realize timely and rapid memory equalization.

Description

MULTI-LoRA-oriented large-model memory optimal configuration system Technical Field The invention relates to a technology in the field of neural networks, in particular to a large-model memory optimal configuration system for Multiple Low-rank adapters (Multiple Low-RankAdaptation, MULTI-LoRA). Background A large language model (Large Language Model, LLM) is an artificial intelligence technology based on deep learning that is trained using large amounts of text data to enable understanding and generation of human language. The low-rank adapter (LoRA) is a popular LLM fine tuning mode at present, and based on the same pre-trained large model, the pre-trained large model can be fine tuned to a plurality of specific tasks at the same time by adding LoRA modules, and each specific task can obtain a corresponding LoRA module after fine tuning is completed. The prior multi LoRA reasoning service technology has the defects that 1, the large-scale multi LoRA reasoning service cannot be supported, and other LoRA reasoning services cannot be carried out after a specific LoRA is fused with the original LLM, so that the efficiency of each reasoning is greatly limited, and the overall end-to-end reasoning performance is influenced. 2. In the actual reasoning process, loRA weights and the two KV caches need to contend with the limited GPU memory. LoRA weights and KV caches affect the reasoning process in different ways, and thus affect the final service effect. Disclosure of Invention Aiming at the problem of memory resource management and scheduling of a large model in a MULTI LoRA reasoning scene in the prior art, the invention provides a MULTI-LoRA-oriented large model memory optimal configuration system, and aiming at a MULTI LoRA reasoning scene, loRA weight and KV cache in GPU memory and main memory are efficiently managed and scheduled, so that the overall resource utilization efficiency is improved. The invention is realized by the following technical scheme: The invention relates to a MULTI-LoRA-oriented large-model memory optimal configuration system which comprises a tree-structure-based memory block manager and a benefit-based memory block scheduler, wherein the memory block manager uniformly divides a GPU memory and a main memory into memory blocks with the same size to fill up the whole memory space, stores LoRA weight and KV cache according to the granularity of the memory blocks, and dynamically migrates LoRA weight and KV cache according to benefits by continuously evaluating benefits of reserving a specific memory block in the GPU memory or the main memory. The invention relates to a large-model memory optimal configuration method based on the system, which comprises the following steps: The method comprises the steps of initializing a memory management system, namely partitioning GPU memory and a main memory according to the size of a unified memory block, arranging the GPU memory and the main memory into a unified index system, constructing a unified memory block tree for managing global memory blocks, and starting a memory block scheduler. Step two, the memory block manager performs LoRA weight and history KV cache matching according to a request initiated by a user, all the required memory blocks are replaced into the GPU memory, and the matched memory blocks are returned to an upper layer for large model reasoning; step three, when the reasoning of the request is finished, the memory block manager inserts the newly generated KV cache into a unified memory block tree for later multiplexing; Continuously detecting the load conditions of the GPU memory and the main memory, and carrying out dynamic scheduling according to the current memory conditions, wherein the method specifically comprises the following steps: a) When the consumption of the GPU memory is smaller than the lower threshold value low_threshold, CPU memory blocks are replaced according to two-level memory management, namely, the weight is replaced with LoRA until the LoRA weight number in the GPU memory reaches N L, the latest used memory block is replaced with priority when replaced, when the consumption of the GPU memory is still smaller than the low_threshold, all the memory blocks are evaluated according to a benefit evaluator, and the CPU memory block with the biggest benefit is replaced until the low_threshold is reached. The lower threshold low_threshold is typically set to 70%. B) When the consumption of the GPU memory is larger than the upper threshold value up_threshold, the cache blocks are swapped out according to two-level memory management, specifically, the memory blocks are swapped out from the GPU memory until the consumption of the GPU memory reaches up_threshold, and the memory blocks with the smallest benefits are swapped out preferentially when the consumption of the GPU memory is swapped out. However, when the number of LoRA weights is found to be no greater than N L in the swap-out process, the scheduler does not