CN-121411943-B - Elastic resource arrangement method and system of distributed intelligent parallel deduction engine

CN121411943BCN 121411943 BCN121411943 BCN 121411943BCN-121411943-B

Abstract

The invention discloses an elastic resource arrangement method and system of a distributed intelligent parallel deduction engine. The method comprises the steps of dynamically marking time periods based on predicted load state parameters of each follow-up simulation time period of each simulation process, which are obtained through dynamic load prediction of simulation deduction data generated in each simulation time period of each simulation process in a distributed simulation deduction system, dynamically setting an optimized dispatching instruction based on time period dynamic marking results and simulation event rules, sequencing the dynamically set dispatching instruction based on topology constraint features of the distributed simulation deduction system, executing, evaluating the container state based on the running state parameters of each simulation container in real time in the simulation deduction process, and carrying out dynamic fault processing based on container state evaluation results and fault types. Through dynamic load prediction and intelligent scheduling, periodic resource demand peaks in distributed simulation are effectively smoothed, and the overall resource utilization rate is remarkably improved.

Inventors

CAO DAOGANG
GUO ZHILI
WANG BIN

Assignees

北京流深数据科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251009

Claims (9)

1. The elastic resource arrangement method of the distributed intelligent parallel deduction engine is characterized by comprising the following steps of: carrying out dynamic load prediction based on simulation deduction data generated in each simulation period of each simulation process in a distributed simulation deduction system to obtain predicted load state parameters of each subsequent simulation period of each simulation process, and carrying out period dynamic marking according to the predicted load state parameters, wherein the distributed simulation deduction system is a large-scale simulation platform supporting parallel deduction of multiple simulation processes, and the simulation deduction data comprises load state parameters and business semantic parameters; the step of dynamically marking the time period comprises the following steps: The predicted load state parameters comprise basic resource parameters and service efficiency parameters, wherein the basic resource parameters comprise CPU utilization rate, GPU utilization rate, memory occupation, message queue depth and network I/O throughput; Calculating to obtain a load state evaluation index of each subsequent simulation period of each simulation process based on the business efficiency parameters of each subsequent simulation period of each simulation process; if the load state evaluation index of a certain subsequent simulation period of a certain simulation process is lower than a preset load state first threshold value or the existing basic resource parameter is lower than a corresponding basic resource parameter first threshold value, marking the simulation period as a first period; If the load state evaluation index of a certain subsequent simulation period of a certain simulation process is not lower than a preset load state first threshold value, but does not exceed a preset load state second threshold value, or the basic resource parameter is not lower than a corresponding basic resource parameter first threshold value and does not exceed a corresponding basic resource parameter second threshold value, marking the simulation period as a second period; if the load state evaluation index of a certain subsequent simulation period of a certain simulation process exceeds a preset load state second threshold value or the existing basic resource parameter exceeds a corresponding basic resource parameter second threshold value, marking the simulation period as a third period; Dynamically setting an optimized scheduling instruction based on a time period dynamic marking result and a simulation event rule, sequencing the dynamically set scheduling instruction based on a topology constraint characteristic of a distributed simulation deduction system, and executing the optimized scheduling instruction based on the sequencing result, wherein the topology constraint characteristic defines a multidimensional association relation and an operation state attribute of a simulation node at a structure and behavior level; In the simulation deduction process, carrying out container state evaluation in real time based on the running state parameters of each simulation container, and carrying out dynamic fault processing based on container state evaluation results and fault types, wherein the fault types comprise performance level abnormality, container level abnormality, node level abnormality and regional level abnormality, and the dynamic fault processing means that corresponding fault processing flows are dynamically executed so as to improve the reliability of the arrangement of elastic resources.
2. The method for arranging elastic resources of a distributed intelligent parallel deduction engine according to claim 1, wherein the method for obtaining the predicted load state parameter of each subsequent simulation period of each simulation process is as follows: S1, aggregating load state parameters and real-time load state parameters of historical simulation periods of each simulation process according to the simulation periods to obtain time sequence data records of each simulation process; S2, if the load state parameter in the time sequence data record has a periodic characteristic in a simulation period, calculating a predicted value of the corresponding load state parameter based on the periodic characteristic; S3, if the slope change value of the load state parameter in the time sequence data record in each adjacent simulation period in the simulation period does not reach a preset slope change threshold value or if the slope change value of the adjacent simulation period reaches the preset slope change threshold value but the frequency does not reach the preset frequency threshold value, the average value of the load state parameter in each simulation period in the simulation period is recorded as the predicted value of the load state parameter corresponding to the simulation process; s4, if the slope change value of the load state parameter in the time sequence data record in the adjacent simulation period reaches a preset slope change threshold value and the number of times reaches a preset number of times threshold value, extracting characteristic data of the load state parameter in each simulation period, and inputting the characteristic data into a preloaded data prediction model to obtain a predicted value of the load state parameter corresponding to the simulation process; And S5, if the load state parameter in the time sequence data record accords with the periodic characteristic condition of S2 and the slope change condition of S3 in the simulation period, respectively carrying out weighted fusion processing on the obtained periodic characteristic predicted value and the average value predicted value and the preset periodic characteristic weight and average value predicted weight to obtain the predicted value of the load state parameter corresponding to the simulation process.
3. The method for arranging elastic resources of a distributed intelligent parallel deduction engine according to claim 1, wherein the load state evaluation index of each subsequent simulation period of each simulation process is obtained as follows: Respectively carrying out ratio operation on the stepping rate reference value, the entity interaction event frequency of each subsequent simulation period of each simulation process and the activity intensity of each region, and the stepping rate, the entity interaction event frequency reference value and the activity intensity reference value corresponding to the stepping rate reference value, respectively carrying out weighting treatment by using the operation result of the service efficiency parameter weight factor comparison value, and then carrying out coupling treatment on the weighting treatment result to obtain the load state evaluation index of each subsequent simulation period of each simulation process; the business efficiency parameter weight factors comprise stepping rate weight factors, entity interaction event frequency weight factors and activity intensity weight factors; the load state evaluation index is a numerical index of comprehensive quantification load state based on business efficiency parameter integration processing.
4. The flexible resource orchestration method of a distributed intelligent parallel deduction engine according to claim 1, wherein the step of dynamically setting scheduling instructions based on time period dynamic marking results and simulation event rules comprises: the scheduling instruction comprises a longitudinal capacity shrinkage instruction, a transverse capacity shrinkage instruction, a longitudinal capacity expansion instruction and a transverse capacity expansion instruction; Processing the scheduling instruction set based on a multi-target resource optimization algorithm to obtain an optimized scheduling instruction set; setting a corresponding optimized dispatching instruction according to a preset simulation event rule based on simulation events in each follow-up simulation period of each simulation process; If the number of the third time periods exceeds the preset time period number threshold value in the follow-up simulation time periods of a certain simulation process, setting an optimized longitudinal capacity expansion instruction for each follow-up simulation time period of the simulation process; if a third period exists in the subsequent simulation period of a certain simulation process and the number of the third period is not more than a preset period number threshold, setting an optimized transverse capacity expansion instruction for each subsequent simulation period of the simulation process; If the third period does not exist in the subsequent simulation period of a certain simulation process, but the first period exists and the simulation periods are not all the first periods, setting an optimized longitudinal capacity-shrinking instruction for each subsequent simulation period of the simulation process; If the simulation time periods are the first time periods in the follow-up simulation time periods of a certain simulation process, setting an optimized transverse capacity-shrinking instruction for each follow-up simulation time period of the simulation process.
5. The method for arranging elastic resources of a distributed intelligent parallel deduction engine according to claim 4, wherein the step of ordering the dynamically set scheduling instructions based on topology constraint features of the distributed simulation deduction system comprises: the topology constraint features comprise physical connection, logic dependence and load states among simulation nodes; classifying the to-be-executed optimal scheduling instructions based on the subsequent simulation time periods to obtain to-be-executed optimal scheduling instruction sets corresponding to the subsequent simulation time periods; Sequencing the corresponding instructions based on the logic dependency rules for each simulation node with logic dependency relationship in the to-be-executed optimized scheduling instruction set of the same subsequent simulation period; Based on the mapping relation between each simulation node and each simulation process, obtaining the load state evaluation index of each simulation node in each subsequent simulation period; For optimized capacity expansion instructions in an optimized scheduling instruction set to be executed in the same subsequent simulation period, for simulation nodes with physical connection relation, carrying out ascending order on corresponding instructions based on bandwidth utilization rate of links between the nodes, and then carrying out ascending order on all optimized capacity expansion instructions according to load state evaluation indexes to obtain a first instruction ordering sequence; the optimized contraction instructions in the to-be-executed optimized scheduling instruction set in the same subsequent simulation period are subjected to descending order sequencing, based on the load state evaluation index, to all the optimized contraction instructions, so that a second instruction sequencing sequence is obtained; And after the first instruction sequencing sequence and the second instruction sequencing sequence corresponding to the same subsequent simulation period are combined, ascending sequencing is performed based on the load state evaluation index, and a final instruction execution sequence corresponding to each subsequent simulation period is obtained.
6. The method for flexible resource orchestration of a distributed intelligent parallel deduction engine according to claim 5, wherein the step of executing the optimized scheduling instruction based on the ordering result comprises: if the number of the optimized dispatching instructions in the same subsequent simulation period does not exceed a preset parallel instruction threshold value, executing all the optimized dispatching instructions in parallel; if the number of the optimized dispatching instructions in the same subsequent simulation period exceeds a preset parallel instruction threshold, each optimized dispatching instruction is sequentially executed based on a corresponding final instruction execution sequence.
7. The method for arranging elastic resources of a distributed intelligent parallel deduction engine according to claim 1, wherein the running state parameters comprise average response time, service response success rate, check failure times and historical average load state evaluation indexes of corresponding simulation processes; The step of evaluating the container state based on the operation state parameters of each simulation container comprises the following steps: Carrying out ratio operation on the average response time reference value, the service response success rate and the inspection failure frequency reference value of each simulation container and the corresponding average response time, service response success rate reference value and inspection failure frequency reference value respectively to obtain response time influence parameters, service response success rate influence parameters and inspection failure frequency influence parameters of each simulation container; Performing deviation coincidence operation on the allowable deviation average load state evaluation index, the historical average load state evaluation index of each simulation container and the average load state evaluation index reference value to obtain an average load state influence parameter of each simulation container; Weighting response time influence parameters, service response success rate influence parameters, inspection failure times influence parameters and average load state influence parameters of each simulation container by using the running state parameter weight factors, and coupling the weighting processing results to obtain running state evaluation indexes of each simulation container; If the running state evaluation index of the simulation container does not reach the preset running state evaluation threshold, marking the simulation container as an abnormal container; If the running state evaluation index of the simulation container reaches a preset running state evaluation threshold value, marking the simulation container as a normal container; The running state parameter weight factors comprise response time weight factors, service response success rate weight factors, check failure times weight factors and historical average load weight factors; the running state evaluation index is a numerical index for comprehensively quantifying the health and efficiency states of the container based on running state parameter integrated processing.
8. The flexible resource orchestration method of a distributed intelligent parallel deduction engine according to claim 7, wherein the step of performing dynamic fault handling based on the container state assessment results and fault types comprises: Marking the simulation container with the check failure times exceeding the preset failure times threshold as a container-level abnormal container; marking a simulation container with abnormal related simulation nodes as a node level abnormal container; marking the simulation container with the fault of the data center or the available area as a regional abnormal container; marking a simulation container which does not meet the container level, node level and regional level abnormality judgment conditions but the running state evaluation index does not reach the preset running state evaluation threshold value as a performance level abnormality container; If the simulation container is a container-level abnormal container, starting a container-level abnormal processing flow; if the simulation container is a node level exception container, starting a node level exception processing flow; If the simulation container is a regional level exception container, starting a regional level exception processing flow; If the simulation container is a performance level exception container, starting a performance level exception processing flow.
9. The elastic resource arrangement system of the distributed intelligent parallel deduction engine is characterized by comprising a time period marking module, a resource scheduling module and a fault processing module: The time period marking module is used for carrying out dynamic load prediction based on simulation deduction data generated in each simulation time period of each simulation process in the distributed simulation deduction system to obtain predicted load state parameters of each subsequent simulation time period of each simulation process, and carrying out time period dynamic marking according to the predicted load state parameters, and the step of carrying out time period dynamic marking comprises the following steps: The predicted load state parameters comprise basic resource parameters and service efficiency parameters, wherein the basic resource parameters comprise CPU utilization rate, GPU utilization rate, memory occupation, message queue depth and network I/O throughput; Calculating to obtain a load state evaluation index of each subsequent simulation period of each simulation process based on the business efficiency parameters of each subsequent simulation period of each simulation process; if the load state evaluation index of a certain subsequent simulation period of a certain simulation process is lower than a preset load state first threshold value or the existing basic resource parameter is lower than a corresponding basic resource parameter first threshold value, marking the simulation period as a first period; If the load state evaluation index of a certain subsequent simulation period of a certain simulation process is not lower than a preset load state first threshold value, but does not exceed a preset load state second threshold value, or the basic resource parameter is not lower than a corresponding basic resource parameter first threshold value and does not exceed a corresponding basic resource parameter second threshold value, marking the simulation period as a second period; if the load state evaluation index of a certain subsequent simulation period of a certain simulation process exceeds a preset load state second threshold value or the existing basic resource parameter exceeds a corresponding basic resource parameter second threshold value, marking the simulation period as a third period; the resource scheduling module is used for dynamically setting scheduling instructions based on a time period dynamic marking result and a simulation event rule, sequencing the dynamically set scheduling instructions based on the topology constraint characteristics of the distributed simulation deduction system and executing optimized scheduling instructions based on the sequencing result; The fault processing module is used for carrying out container state evaluation based on the running state parameters of each simulation container in real time in the simulation deduction process and carrying out dynamic fault processing based on container state evaluation results and fault types.

Description

Elastic resource arrangement method and system of distributed intelligent parallel deduction engine Technical Field The invention relates to the technical field of elastic resource data management, in particular to an elastic resource arrangement method and an elastic resource arrangement system of a distributed intelligent parallel deduction engine. Background In a distributed simulation deduction system, deduction is carried out on the premise that the expected days are complex, the entity quantity is large and the interaction is frequent, so that the calculation load shows remarkable dynamic fluctuation. The conventional static resource allocation mode is difficult to cope with the load change of alternating peaks and valleys, and is easy to cause resource bottleneck or greatly idle. The flexible resource scheduling technology of the distributed intelligent parallel deduction engine is generated to solve the core challenge. The deduction load is monitored in real time, and the intelligent decision algorithm is used for dynamically allocating and expanding the bottom computing resources as required, so that the utilization efficiency of cluster resources is greatly improved while the deduction instantaneity and continuity are guaranteed, and the modern large-scale simulation deduction application with high concurrency, low delay and high availability is realized. The existing elastic resource arrangement method mainly comprises the steps of continuously collecting performance indexes of clusters and applications through a monitor, analyzing the time sequence data through an analyzer to judge the current state, calculating according to a preset elastic strategy by a decision maker to make expansion and contraction capacity decision, and finally executing dynamic allocation of resources through calling interfaces of a bottom infrastructure by an executor. While this technology is evolving from reactive scaling of static thresholds to predictive scaling based on machine learning, aimed at improving resource efficiency, significant challenges remain in handling such specific loads as distributed simulation deductions. The virtual machine resource allocation method, device, system and storage medium of the invention patent publication with the publication number of CN112860370B comprises the steps that a network function virtualization orchestrator NFVO calculates resources required by the VNF, the NFVO inquires idle resources of a virtualization infrastructure manager VIM, the NFVO determines reserved resources of the VNF according to the idle resources of the VIM and the pre-recorded reserved resources of other VNs, and the NFVO reserves the resources required by the VNF for the VNF and records the corresponding relation between the VNF and the reserved resources under the condition that the reserved resources of the VNF are more than the resources required by the VNF. The self-adaptive cloud management platform system based on intelligent resource scheduling and container arrangement, for example, disclosed in the patent of the invention with the publication number of CN120429116A, comprises a refined resource scheduling and self-adaptive optimization module, a containerized application life cycle management and dynamic container arrangement module, a high-precision operation and maintenance monitoring and self-healing mechanism module based on big data analysis, and a dynamic resource allocation and elastic expansion strategy module of an intelligent scheduling engine. However, in the process of implementing the technical scheme of the embodiment of the application, the application discovers that the above technology has at least the following technical problems: In the prior art, the demand of a single deduction sample for computing resources in different simulation stages is severely fluctuated, the peak demand of resources is periodically generated, when a large number of samples are deduced in parallel, the respective simulation progress is independent, the peak periods inside the samples are randomly distributed in time, overlapping is most likely to occur at a certain moment, a plurality of samples apply for peak resources to a resource pool at the same time, serious resource contention is caused, and the problem of low resource utilization rate exists. Disclosure of Invention The embodiment of the application solves the problems of low resource utilization rate caused by the fact that in the prior art, the demand of a single deduction sample for computing resources in different simulation stages fluctuates severely, the resource demand peak appears periodically, when a large number of samples are deducted in parallel, the peak periods inside the samples are randomly distributed in time due to independent simulation progress, the samples are likely to overlap at a certain moment, peak resources are applied to a resource pool simultaneously, serious resource contention is caused, and the periodic reso