CN-122019199-A - One-stop large model intelligent body development operation and maintenance platform for fusion force dispatching and model management

CN122019199ACN 122019199 ACN122019199 ACN 122019199ACN-122019199-A

Abstract

The invention discloses a one-stop large model intelligent body development operation and maintenance platform integrating calculation force dispatching and model management, which relates to the technical field of artificial intelligence and comprises a heterogeneous calculation force resource pool, a calculation force generation module and a calculation force generation module, wherein the heterogeneous calculation force resource pool comprises a plurality of reasoning nodes and generates node-level calculation force state abstract records; the system comprises a power computing scheduling control node, a self-adaptive shape-preserving prediction confidence degree checking module and a self-adaptive shape-preserving prediction confidence degree checking module, wherein the power computing scheduling control node is used for collecting node level power computing state abstract records of all reasoning nodes to obtain a global power computing state table, the instant reward signal is the difference between a scheduling throughput component and a queue depth value punishment component of a resource constraint virtual queue, and the self-adaptive shape-preserving prediction task allocation scheme is selected or deterministic rollback scheduling based on the global power computing state table is executed according to whether the radius of a confidence interval exceeds a preset radius threshold. The method can effectively relieve the bandwidth bottleneck of the memory in the ultra-long context reasoning scene and improve the self-adaptive response capability of the scheduling decision to the real-time load change and the environmental drift.

Inventors

LIU DEYONG
CHEN YANZHANG
WANG BO
Hao Juntang
Bu Caixia
HOU YANHAI
YANG YICHAO
YU YIMIN
SU WENHUI

Assignees

山东环球软件股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (10)

1. One-stop large model intelligent agent development operation and maintenance platform integrating power dispatching and model management is characterized by comprising the following components: The heterogeneous computing power resource pool comprises a plurality of reasoning nodes, wherein each reasoning node is provided with a main computing chip and a high-bandwidth memory, and an in-memory computing unit is deployed in a memory core grain of the high-bandwidth memory, and the in-memory computing unit is used for executing near data attention computation of large model reasoning and periodically collecting local computing power state indexes; The system comprises a power computing dispatching control node, a real-time rewarding signal, a dispatching state vector output task allocation scheme, a real-time rewarding signal and a resource constraint virtual queue, wherein the power computing dispatching control node is used for collecting node-level power computing state abstract records of all reasoning nodes to obtain a global power computing state table, maintaining a resource constraint virtual queue for each reasoning node, and deploying a dispatching agent comprising a strategy network, a value network and a throughput prediction network; The self-adaptive conformal prediction confidence degree checking module is operated in the computational power scheduling control node, a calibration sample sliding buffer area is maintained, self-adaptive conformal prediction is executed based on the absolute difference value of the total number of the reasoning tasks actually completed in each scheduling period and the number of the tasks predicted to be completed by the throughput prediction network, the radius of a confidence interval is obtained, and a distribution scheme of the tasks is selected or deterministic rollback scheduling based on the global computational power state table is executed according to whether the radius of the confidence interval exceeds a preset radius threshold.
2. The one-stop large model agent development operation and maintenance platform integrating computational power scheduling and model management according to claim 1, wherein the in-memory computing unit comprises 1 dot product operation array, 1 exponential approximation function unit realized based on a piecewise linear lookup table and 1 group of local feature registers.
3. The one-stop large model agent development operation and maintenance platform integrating calculation power scheduling and model management according to claim 2 is characterized in that near data attention calculation is carried out in 3 stages, namely 1 st stage, a dot product operation array of each in-memory calculation unit reads key vector fragments remained in a local storage body, element-by-element multiplication is carried out on a query vector and each key vector and the element-by-element multiplication is carried out on the query vector and the key vector are accumulated step by step along a vector dimension to obtain a local attention score sequence, the local attention score sequence is traversed to take a maximum value as a local limit scalar and return the local limit scalar to a main calculation chip, and the main calculation chip takes the maximum value of all the local limit scalar as a global limit scalar and broadcasts the global limit scalar to all the in-memory calculation units; step 2, each in-memory computing unit subtracts the global extremum scalar from each score value in the local attention score sequence to obtain an offset score sequence, performs index mapping on the offset score sequence by using an index approximation function unit to obtain a local index value sequence, sums the local index value sequence to obtain a local index accumulation scalar, performs element-by-element multiplication on each index value in the local index value sequence and a value vector at a corresponding position in a local memory bank, accumulates to obtain a local weight vector, and returns the local index accumulation scalar and the local weight vector to a main computing chip; and 3, accumulating all the local weighted value vectors element by the main computing chip to obtain a global weighted value vector, summing all the local index accumulated scalars to obtain a global index accumulated scalar, and dividing each element of the global weighted value vector by the global index accumulated scalar to obtain a final output vector of the current attention head.
4. The one-stop large model intelligent agent development operation and maintenance platform integrating calculation power dispatching and model management according to claim 1 is characterized in that the local calculation power state index comprises 3 items, namely, a 1 st item is a storage line activation count value, namely, the accumulated times of storage line activation operation triggered by an in-storage computing unit in a current acquisition period, a 2 nd item is a calculation channel occupation time ratio, namely, the ratio of the clock cycle number of each calculation channel in an operation active state of the in-storage computing unit in the current acquisition period to the total clock cycle number of the acquisition period, a 3 rd item is an interface transmission byte number, namely, the total byte number of data sent to a main computing chip by the in-storage computing unit in the current acquisition period through a high bandwidth memory interface, and a node-level calculation power state abstract record is used as a record generated by taking an arithmetic average value of each of the 3 index values of all in-storage computing units under an inference node.
5. The one-stop large model agent development operation and maintenance platform integrating calculation power dispatching and model management according to claim 1 is characterized in that the resource constraint virtual queue comprises a video memory occupation virtual queue and a response delay virtual queue, wherein the updating mode of the video memory occupation virtual queue is that the actual video memory occupation of an inference node in a current dispatching period is divided by a preset video memory capacity threshold value of the platform to obtain a normalized video memory occupation rate, the difference value obtained by subtracting 1 from the normalized video memory occupation rate is obtained, the difference value is added to the current queue depth value of the video memory occupation virtual queue when the difference value is larger than 0, the absolute value of the difference value is deducted from the current queue depth value when the difference value is not larger than 0, the lower limit is truncated to 0, the response delay virtual queue is updated in the same mode, and the 95 th percentile response delay of the inference node in the current dispatching period is divided by the preset response delay upper limit value of the platform.
6. The one-stop large model intelligent agent development operation and maintenance platform integrating calculation power dispatching and model management according to claim 1 is characterized in that a dispatching state vector is a fixed dimension vector and is constructed by arranging node level calculation power state abstract records of all inference nodes in a global calculation power state table according to a node number sequence, taking the context token number of the first M tasks and the concurrent request number after the current task queue to be dispatched is arranged from large to small according to the context token number, filling the vacant positions of less than M tasks with 0, and splicing the current queue depth values of resource constraint virtual queues of all inference nodes, wherein M is the upper limit of task slot number preset by the platform.
7. The one-stop large model agent development operation and maintenance platform integrating calculation power dispatching and model management according to claim 1 is characterized in that a strategy network receives dispatching state vectors and outputs probability distribution distributed to inference nodes for each task to be dispatched, a calculation power dispatching control node performs 1-time random sampling on each task to be dispatched according to the probability distribution to determine a target inference node, a task distribution scheme is formed after all task sampling is completed, dispatching throughput components are the total number of inference tasks actually completed in a current dispatching period, penalty components are the sum of current queue depth values of resource constraint virtual queues of all the inference nodes item by item and multiplied by fixed adjustment coefficients, instant reward signals take the result of subtracting the penalty components from the dispatching throughput components, and the dispatching agent performs gradient updating on the strategy network and a value network according to a cutting updating rule of a near-end strategy optimization algorithm.
8. The one-stop large model intelligent agent development operation and maintenance platform integrating calculation force dispatching and model management according to claim 1 is characterized in that the fixed capacity of a calibration sample sliding buffer area is N, absolute difference values of the total number of actually completed reasoning tasks and the number of predicted completion tasks of a throughput prediction network are written into the tail end of the calibration sample sliding buffer area as inconsistent values after each dispatching cycle is completed, and the forefront records are removed, and the execution mode of self-adaptive conformal prediction is that exponential decay coefficients are respectively given to the N existing inconsistent values in the calibration sample sliding buffer area according to time sequence to enable recent samples to obtain larger coefficient values, all the inconsistent values are arranged in an ascending order, the corresponding exponential decay coefficients are accumulated one by one along the arrangement order until the accumulated sum reaches or exceeds the product of a preset confidence level and the sum of all the exponential decay coefficients for the first time, the corresponding inconsistent values at the moment are taken as conformal prediction critical values, and the conformal prediction critical values are taken as the confidence interval radiuses of the current dispatching cycle.
9. The one-stop large model intelligent agent development operation and maintenance platform integrating calculation power scheduling and model management according to claim 1 is characterized in that the execution mode of deterministic rollback scheduling is that a storage line activation count average value in a node level calculation power state abstract record of each inference node is read from a global calculation power state table as the current load capacity of the inference node, tasks to be scheduled are arranged according to the number of context tokens from large to small, each task is distributed to the inference node with the smallest current load capacity in sequence, and the number of context tokens of the corresponding task is added to the current load capacity of a receiving inference node after 1 time of distribution is completed until all tasks are distributed and then issued for execution.
10. The one-stop large model agent development operation and maintenance platform integrating computational power dispatching and model management according to claim 1, wherein when the radius of a confidence interval exceeds a preset radius threshold, a computational power dispatching control node marks a current dispatching cycle as a low confidence cycle, and in a training batch collected by a current strategy version of the dispatching agent, a state transition sample generated by the low confidence cycle is multiplied by an increased loss scaling factor when a strategy gradient is calculated, so that the strategy adjustment amplitude of a high uncertainty state area is increased in the current batch gradient update of a strategy network.

Description

One-stop large model intelligent body development operation and maintenance platform for fusion force dispatching and model management Technical Field The invention relates to the technical field of artificial intelligence, in particular to an industrial Internet platform, and specifically relates to a one-stop large-model intelligent agent development operation and maintenance platform integrating power dispatching and model management. Background With the continuous expansion of large model reasoning service scale, how to implement efficient resource scheduling and operation and maintenance management in heterogeneous computing clusters has become a core problem of interest in the industry. Existing large model reasoning cluster scheduling schemes can be broadly divided into three categories. The first type is a static scheduling scheme based on rules, which comprises methods of polling scheduling, weighted polling, consistent hashing and the like, and the scheme is simple to realize and low in cost, but cannot sense the real-time load state of each reasoning node, and is easy to cause overload of part of nodes and idle resource waste of other nodes under the scene of load fluctuation or unbalanced node performance. The second type is a dynamic scheduling scheme based on reinforcement learning, and task allocation decisions are generated according to the real-time state of the clusters through a strategy network, so that the self-adaptive scheduling system has self-adaptive adjustment capability. However, the existing reinforcement learning scheduling scheme generally takes the maximization of the scheduling throughput as a single optimization target, lacks a long-term stability guarantee mechanism for resource constraints such as video memory occupation and response time delay, and the like, so that the strategy network frequently triggers resource overrun in the process of pursuing short-term throughput, and the system stability is insufficient. The third type is an operation and maintenance monitoring scheme based on a threshold rule, and when the resource index exceeds a preset threshold, the expansion and contraction capacity or task migration operation is triggered. The scheme belongs to a passive response mechanism, the credibility of a scheduling decision cannot be pre-judged in advance, and the risk of the scheduling scheme cannot be quantitatively evaluated before the scheduling scheme is issued and executed. In addition, in the hardware level, the existing large model reasoning architecture generally relies on a main computing chip to execute all attention computation, key value buffering needs to be carried from a high-bandwidth memory to the main computing chip side to participate in operation, and the transmission bandwidth of a memory interface becomes a main bottleneck of reasoning time delay in an ultra-long context scene. Although in-memory computing technology has shown the potential of sinking operations to the memory end to reduce data handling in academic research, the existing scheme has not formed closed-loop linkage of hardware-level computing power state indexes generated by in-memory computing units and an upper-layer scheduling decision system, and the real-time load sensing capability of the hardware level cannot be effectively utilized by a scheduling policy of the software level. In summary, the prior art still has the defects in aspects of computational power perception precision, long-term stability guarantee of resource constraint, confidence quantification of scheduling decision and the like. Disclosure of Invention The invention aims to provide a one-stop large model intelligent body development operation and maintenance platform integrating power dispatching and model management, which can effectively relieve the bandwidth bottleneck of a memory in an ultra-long context reasoning scene and improve the self-adaptive response capability of a dispatching decision to real-time load change and environmental drift. In order to solve the technical problems, the invention provides a one-stop large model agent development operation and maintenance platform integrating force scheduling and model management, which comprises: The heterogeneous computing power resource pool comprises a plurality of reasoning nodes, wherein each reasoning node is provided with a main computing chip and a high-bandwidth memory, and an in-memory computing unit is deployed in a memory core grain of the high-bandwidth memory, and the in-memory computing unit is used for executing near data attention computation of large model reasoning and periodically collecting local computing power state indexes; The system comprises a power computing dispatching control node, a real-time rewarding signal, a dispatching state vector output task allocation scheme, a real-time rewarding signal and a resource constraint virtual queue, wherein the power computing dispatching control node is used for collecting node-level