CN-121560567-B - LoRA fine-tuned dynamic computing power resource allocation method

CN121560567BCN 121560567 BCN121560567 BCN 121560567BCN-121560567-B

Abstract

The invention relates to the field of resource optimization, in particular to a LoRA fine-tuned computing power resource dynamic allocation method, which is characterized in that the prior characteristics representing the fragmentation degree and the load level of resources are obtained by acquiring the video memory state of a computing cluster and the load data flow of a computing unit according to historical time sequence data analysis, bottleneck labels are set for various computing tasks according to the prior characteristics, when the tasks are executed, differential scheduling optimization is carried out according to the label types and the real-time resource characteristics, the real-time video memory state characteristics are monitored, the video memory bottleneck tasks are temporarily detained until the video memory bottleneck tasks are reduced to a preset threshold value and are executed, and when the load meets the conditions, the real-time load state characteristics are monitored, and the calculation bottleneck tasks are started. The invention realizes the perception response to the dynamic resource state of the cluster and the targeted scheduling to the task bottleneck type, relieves the blocking and idling caused by the fragmentation of the video memory and the uneven distribution of the resources, and improves the resource utilization balance degree and the overall execution efficiency of LoRA fine-tuning tasks.

Inventors

ZHANG MINGYUE

Assignees

北京基流科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260120

Claims (5)

1. A method for dynamically allocating LoRA finely tuned computing power resources, comprising: Acquiring a video memory state data stream and a calculation unit load data stream when various calculation tasks are executed in a historical operation period of a calculation cluster; Analyzing priori video memory state characteristics based on video memory state data streams to represent the fragmentation degree of a video memory space of a computing cluster, and analyzing priori load state characteristics based on unit load data streams to represent the load degree of computing resources of the computing cluster; bottleneck labels are set for various computing tasks based on prior memory state features and prior load state features, including, Determining a relative magnitude relation between the prior video memory state characteristics and the prior load state characteristics when executing various types of computing tasks according to historical operation data; If the prior video memory state characteristic value is smaller than the prior load state characteristic value corresponding to the calculation task when executing a certain type of calculation task, setting a video memory bottleneck label for the calculation task; If the prior load state characteristic value is smaller than or equal to the prior video memory state characteristic value corresponding to a certain type of calculation task when the calculation task is executed, setting a calculation bottleneck label for the calculation task; receiving and analyzing a to-be-processed computing task queue, and extracting resource demand characteristics corresponding to each computing task; resource optimization in task queue execution is performed based on a priori resource demand characteristics of the computing task and the corresponding bottleneck label, including, Temporarily suspending the current task, monitoring and calculating real-time video memory state characteristics in the cluster, and executing the current task after the video memory state characteristics are reduced to a preset threshold value; Determining whether to execute the current task based on real-time load status characteristics in the computing cluster; the preset threshold is determined based on the display memory state characteristics corresponding to the computing task; The video memory state data stream comprises node quantity time sequence data with continuous large-capacity free video memories and inter-node video memory mutual access delay time sequence data with continuous large-capacity free video memories; The method comprises the steps of analyzing prior video memory state characteristics based on video memory state data flow to represent and calculate fragmentation degree of a cluster video memory space, determining a spare node quantity average value in a preset statistical period according to node quantity time sequence data with continuous large-capacity spare video memories, determining a mutual access delay average value in the preset statistical period according to inter-node video memory mutual access delay time sequence data with continuous large-capacity spare video memories, determining a ratio of the spare node quantity statistical value to the total number of cluster computing nodes as a first video memory parameter, determining a ratio of a preset delay standard value to the mutual access delay average value as a second video memory parameter, carrying out weighted summation on the first video memory parameter and the second video memory parameter, and taking a weighted summation result as prior video memory state characteristics; The suspending the current task, monitoring and calculating the real-time video memory state characteristics in the cluster, and executing the current task after the video memory state characteristics are reduced to a preset threshold value comprises, Responding to the bottleneck label of the current task as the video memory bottleneck label, suspending the execution of the task, acquiring the real-time video memory state data stream of the computing cluster in real time, and analyzing the real-time video memory state characteristics based on the real-time video memory state data stream; if the real-time video memory state characteristics are smaller than or equal to a preset unlocking threshold, video memory resources are allocated for the current task and the task is started to be executed; the predetermined unlock threshold is determined based on an expected memory footprint of the current task.
2. The method of claim 1, wherein the computing unit load data stream comprises, GPU computing core utilization time sequence data of each computing node in the computing cluster and available computing core quantity time sequence data.
3. The method of claim 2, wherein parsing a priori load status characteristics based on the unit load data stream to characterize the load level of computing resources of the computing cluster comprises, According to the GPU computing core utilization rate time sequence data, determining a computing core utilization rate average value in a preset statistical period; determining an average value of the number of the available cores in a preset statistical period according to the time sequence data of the number of the available computing cores; Determining the average value of the utilization rate of the computing core as a first load parameter; determining the ratio of the average value of the number of the available cores to the total number of the cluster computing cores as a second load parameter; and carrying out weighted summation on the first load parameter and the second load parameter, and taking a weighted summation result as the prior load state characteristic.
4. The method of claim 1, wherein extracting the resource demand characteristics corresponding to each computing task comprises, Analyzing configuration parameters of the computing task to obtain the expected memory occupation amount and the expected computing core occupation rate of the computing task; and taking the expected memory occupation amount and the expected computing core occupation rate as the resource demand characteristics corresponding to the computing task.
5. The method of claim 1, wherein determining whether to perform the current task based on real-time load status characteristics in the computing clusters comprises, Responding to the bottleneck label of the current task as a calculation bottleneck label, acquiring a real-time calculation unit load data stream of the calculation cluster in real time, and analyzing real-time load state characteristics based on the real-time calculation unit load data stream of the calculation cluster; Determining a predetermined calculation threshold based on a mapping relation between the expected calculation core occupancy and the real-time load state characteristics; if the real-time load state characteristic is smaller than or equal to the preset calculation threshold value, executing the current task; wherein the mapping relation is preset.

Description

LoRA fine-tuned dynamic computing power resource allocation method Technical Field The invention relates to the field of resource optimization, in particular to a LoRA fine-tuning dynamic allocation method of computing power resources. Background Along with the continuous expansion of the scale of deep learning models, efficient fine tuning technology of parameters represented by LoRA (Low-Rank Adaptation) has been widely applied to large-scale computing clusters to perform massive model Adaptation tasks. Under such a distributed training scenario, dynamic scheduling of computing tasks and efficient allocation of cluster resources are key to guaranteeing overall throughput and hardware utilization. Currently, in order to cope with the significant differences of different LoRA fine-tuning tasks in video memory requirements, computation intensity and multi-card communication overhead, and dynamic problems of video memory fragmentation, load imbalance among nodes and the like caused by task operation, a common cluster task scheduler mostly adopts a static allocation strategy based on first come first serve, fixed priority or simple resource reservation. The method generally performs one-time scheduling only according to the preset resource request quantity of the task, and lacks continuous sensing and dynamic response capability on the real-time resource state of the cluster. Therefore, how to dynamically adapt the resource allocation policy to the change of the resource state in the cluster and perform differential scheduling for different task bottleneck types becomes a key problem for improving LoRA fine-tuning task execution efficiency and cluster resource utilization efficiency. CN117785482A discloses a power computing network power computing dispatching system and method, wherein the power computing dispatching system comprises the steps of carrying out predictive analysis on a data request of a target data source in a preset time period in the future to obtain a corresponding trend prediction index, receiving the data request of the target data source, sending the data request to a power computing resource dispatching module, carrying out power computing dispatching distribution calculation according to the trend prediction index, the data request and the specific condition of the power computing network module, and carrying out power computing distribution and calculation according to the result of the power computing dispatching distribution calculation by the power computing network module. And carrying out predictive analysis on the data request at the preset time in the future according to the condition of the historical data request, taking the obtained trend predictive index into the process of computing power distribution, and considering the influence of the computing power network equipment and the specific category of the data request when the computing power distribution is carried out, so that the computing power resource allocation efficiency and the capability are fully improved. However, the following problems exist in the prior art: 1. In the prior art, when a task is finely tuned by scheduling LoRA, the task is generally allocated at one time only according to the static resource demand quantity declared by the task, but the change of the video memory state caused by the dynamic release and fragmentation accumulation of the task in the running process of the cluster is not continuously perceived, so that a large amount of video memory resources cannot be effectively utilized due to fragmentation, the video memory intensive task is often blocked for a long time due to waiting for a large continuous video memory, and the overall task throughput rate of the cluster is reduced. 2. In the prior art, a recognition and differential scheduling mechanism for the bottleneck type of the computing task resource is lacking. The existing scheduling method often adopts the same resource checking and executing strategy for all tasks, and can not distinguish the video memory sensitive task from the computing sensitive task, so that effective supply of bottleneck resources can not be guaranteed preferentially when resources are tense, and unbalanced utilization of computing resources and video memory resources is caused, and part of resources are idle and the other part of resources are overloaded. 3. In the prior art, the influence of the topological characteristics of the resource distribution in the multi-node environment on the task execution efficiency is not fully considered. For example, the cross-node resource states such as the inter-node video memory access delay, the communication bandwidth occupation and the like are not included into the scheduling decision factors, so that tasks can be scheduled to be executed on the node combination with sufficient local video memory and larger communication expense, the task completion time is increased, and the execution efficiency of multi-card paralle