CN-122020639-A - Dynamic monitoring and optimizing method and system for artificial intelligent sandbox environment

CN122020639ACN 122020639 ACN122020639 ACN 122020639ACN-122020639-A

Abstract

The invention provides a dynamic monitoring and optimizing method and a system for an artificial intelligent sandbox environment, which relate to the technical field of artificial intelligent safety. And extracting root cause information based on the vectors, clustering and predicting resource requirements through historical records, generating adjustment requirements, and correcting the requirements and dividing adjustment components through loop dependence analysis. And finally solving the resource allocation proportion by integrating the multi-model information to generate an environment tuning instruction. The method and the device realize accurate positioning of abnormal behaviors in the sandboxes, root cause analysis and dynamic resource optimization, and improve monitoring efficiency and resource utilization rate.

Inventors

MENG PANQIANG
CAO XU

Assignees

北京尚云数智科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260211

Claims (10)

1. The dynamic monitoring and optimizing method of the artificial intelligent sandbox environment is characterized by comprising the following steps of: Acquiring execution state data of an artificial intelligent model running in a sandbox environment, wherein the execution state data comprises resource occupation information and output behavior characteristics; Performing anomaly identification on the execution state data, constructing a directed dependency graph by taking each monitoring dimension as a node, determining an anomaly starting node through reverse tracing, determining an influence propagation link through forward traversal, and generating an anomaly characteristic vector; Extracting root factor dimension information from the abnormal feature vector, carrying out segmented clustering on the historical resource adjustment record according to the root factor dimension information, identifying a transfer rule among clusters, predicting a resource demand state according to the cluster to which the current root factor dimension information belongs, and generating a resource adjustment demand; Analyzing abnormal information from the abnormal feature vector, and correcting the resource adjustment requirement based on loop dependence analysis, wherein the abnormal information is divided into an instant adjustment component and a reserved adjustment component; aiming at root factor dimension information and propagation depth information of abnormal feature vectors respectively corresponding to a plurality of artificial intelligent models in a sandbox environment, solving a resource allocation proportion and generating an environment tuning instruction.
2. The method of claim 1, wherein performing anomaly identification on the execution state data, constructing a directed dependency graph with each monitored dimension as a node, determining an anomaly start node by reverse traceback, determining an effect propagation link by forward traversal, and generating an anomaly feature vector comprises: Taking each monitoring dimension in the execution state data as a node, calculating the time when the numerical value of different monitoring dimensions changes in adjacent time windows, sequencing each node according to the numerical value change time, establishing a directed edge from a preceding node to a subsequent node for the node pair adjacent to the sequencing, and constructing a directed dependency graph; Identifying nodes with values deviating from a preset range in the directed dependency graph as candidate abnormal nodes, determining corresponding abnormal type information according to constraint types deviated by each candidate abnormal node, carrying out reverse tracing on each candidate abnormal node along a directed edge to obtain root cause candidate nodes, calculating an abnormal elimination rate and an influence range coefficient through value replacement deduction, and determining an abnormal starting node; Forward traversal is carried out from an abnormal starting node along a directed edge, a node sequence passing through in the traversal process is recorded, an influence propagation link is determined, monitoring dimension identifiers corresponding to all nodes are extracted, path lengths from the abnormal starting node to a termination node are counted, propagation depth information is determined, the monitoring dimension identifiers corresponding to the abnormal starting node are used as root factor dimension information, and the propagation depth information, the root factor dimension information and the abnormality type information are combined to form an abnormality feature vector.
3. The method of claim 2, wherein obtaining root candidate nodes from each candidate abnormal node by retrospective tracing along a directed edge, calculating an abnormal elimination rate and an influence range coefficient by numerical replacement deduction, and determining an abnormal starting node comprises: Carrying out reverse tracing along the directed edges from each candidate abnormal node, and marking all nodes on the reverse path as root candidate nodes; replacing the numerical value state of each root cause candidate node with the historical numerical value mean value of the monitoring dimension corresponding to each root cause candidate node, keeping the numerical value states of other nodes unchanged, and carrying out forward numerical propagation on the downstream nodes along the directed edges according to the dependency relationship to obtain the deduction numerical value of each candidate abnormal node; Calculating deviation between the deduction value and the actual observation value of the candidate abnormal node, and comparing the deviation with the abnormal deviation amplitude of the candidate abnormal node to obtain the abnormal elimination rate of each root cause candidate node; Based on the topological structure of the directed dependency graph, forward traversal is carried out from each root candidate node along the directed edge, the number of downstream nodes covered by the traversal is counted, and the influence range coefficient of each root candidate node is quantized; and calculating a causal contribution value by combining the abnormal elimination rate and the influence range coefficient, and selecting a root candidate node with the largest causal contribution value as an abnormal starting node.
4. The method of claim 1, wherein extracting root dimension information from the abnormal feature vector, clustering historical resource adjustment records in segments according to the root dimension information, and identifying transfer rules among clusters, predicting a resource demand state according to the cluster to which the current root dimension information belongs, and generating a resource adjustment demand comprises: extracting root dimension information from the abnormal feature vector, and screening a history record subset with the same root dimension identification from the history resource adjustment record according to the root dimension information; Forming time sequence samples by the resource adjustment quantity and adjustment time corresponding to each record in the history record subset, calculating the feature similarity among the samples, and dividing the samples with the feature similarity exceeding a preset clustering threshold into the same cluster to obtain a plurality of clusters; extracting propagation depth information from the abnormal feature vector, determining the division granularity of a time window, and carrying out alignment marking on the distribution intervals of the plurality of clusters on a time axis according to the division granularity; Counting the distribution position of each cluster on a time axis, identifying the transfer probability of transferring from a first cluster to a second cluster, adjusting the current cluster to which the record belongs according to the resource corresponding to the current root factor dimension information, combining the transfer probability of the current cluster to other clusters, and predicting the target cluster of the next period; And extracting the resource adjustment quantity corresponding to each history record in the target cluster, calculating the distribution characteristic value of the resource adjustment quantity, determining the resource demand state, and generating the resource adjustment demand.
5. The method of claim 1, wherein resolving anomaly information from anomaly characteristic vectors, correcting resource adjustment requirements based on loop-dependent analysis, dividing into an instantaneous adjustment component and a reserved adjustment component comprises: Analyzing the abnormal type information and the propagation depth information from the abnormal feature vector, determining a resource type identifier according to the abnormal type information, identifying a loop-dependent node set influencing the formation of a closed loop in a propagation link, counting the number of in-degree and the number of out-degree of each node in the loop-dependent node set, calculating the ratio of the number of in-degree to the number of out-degree to obtain a coupling strength coefficient, and performing product operation on a resource adjustment requirement and the coupling strength coefficient to obtain a corrected resource adjustment requirement; Acquiring current quota amounts and reserved quota amounts of all resource types in a preset resource quota pool, calculating a time attenuation factor according to propagation depth information, reducing corrected resource adjustment demands according to the time attenuation factor in a segmented mode to obtain a plurality of segmented resource demands, extracting the decreasing rate of each segmented resource demand, classifying the segmented resource demands with decreasing rates higher than a preset demarcation threshold value as an instant adjustment component, and classifying the segmented resource demands lower than the demarcation threshold value as a reserved adjustment component; and calculating priority weight according to the time period corresponding to the instant adjustment component, performing proportional allocation on the current quota quantity, transferring the insufficient allocation part into the reserved adjustment component, and performing segment locking in the reserved quota quantity according to the time period corresponding to the reserved adjustment component to generate a resource allocation scheme.
6. The method of claim 1, wherein solving the resource allocation ratio and generating the environment tuning instruction for root dimension information and propagation depth information of abnormal feature vectors respectively corresponding to the plurality of artificial intelligence models in the sandbox environment comprises: extracting root factor dimension information and propagation depth information in abnormal feature vectors corresponding to all artificial intelligent models, constructing a causal correlation graph according to the root factor dimension information, identifying a model subset sharing the same root factor node, calculating a path weight accumulation value from the root factor node to a self node of each artificial intelligent model in the model subset, and determining response sensitivity; Multiplying the instant adjustment component by the response sensitivity to obtain corrected instant demand, performing variance analysis on each corrected instant demand in the model subset, identifying a fluctuation model group with variance exceeding a preset discrete threshold, performing time sequence alignment on the corrected instant demand of each artificial intelligent model in the fluctuation model group according to the propagation depth information, calculating a competition strength index through fusion of a time conflict coefficient and a resource overlapping coefficient, and identifying a synchronous competition model; constructing a competition strength coefficient according to the competition strength index, and carrying out weighted distribution on the instant demand according to the reciprocal of the competition strength coefficient to obtain an instant resource distribution proportion; Aiming at the asynchronous competition model, the reserved adjustment component is unfolded according to a time axis, the accumulated quantity increase slope is calculated to determine the priority, the instant resource allocation proportion and the reserved resource allocation proportion are combined, and the environment tuning instruction is generated.
7. The method of claim 6, wherein aligning corrected instant demand amounts of each artificial intelligence model in the wave model group according to the propagation depth information, calculating a competition strength index by fusing a time conflict coefficient and a resource overlap coefficient, and identifying a synchronous competition model comprises: carrying out time sequence alignment on the corrected instant demand quantity of each artificial intelligent model in the fluctuation model group according to the propagation depth information, constructing a time sequence curve, carrying out time domain decomposition on the time sequence curve, and extracting the time of a demand peak value and the duration of the peak value; combining the artificial intelligent models in the fluctuation model group in pairs, calculating the time interval between the demand peak moments of each group of artificial intelligent models, and marking the artificial intelligent model combination with the time interval smaller than a preset time threshold as a time overlapping model pair; For each time overlapping model pair, calculating the length of an intersection interval of the peak duration of each artificial intelligent model in the time overlapping model pair, and dividing the length of the intersection interval by the minimum value of the value in the peak duration to obtain a time conflict coefficient; Projecting the real-time adjustment components of the time overlapping model pair to a resource quota space to form resource demand vectors, and calculating an included angle cosine value between the resource demand vectors to obtain a resource overlapping coefficient; and carrying out fusion calculation on the time conflict coefficient and the resource overlapping coefficient to obtain a competition strength index, and identifying a synchronous competition model according to the competition strength index.
8. A dynamic monitoring and tuning system for an artificial intelligence sandbox environment for implementing the method of any of the preceding claims 1 to 7, comprising: The state monitoring unit is used for acquiring execution state data of an artificial intelligent model running in the sandbox environment, wherein the execution state data comprises resource occupation information and output behavior characteristics; The anomaly tracing unit is used for carrying out anomaly identification on the execution state data, constructing a directed dependency graph by taking each monitoring dimension as a node, determining an anomaly starting node through reverse tracing, determining an influence propagation link through forward traversal and generating an anomaly characteristic vector; The root factor prediction unit is used for extracting root factor dimension information from the abnormal feature vector, carrying out segmented clustering on the historical resource adjustment record according to the root factor dimension information, identifying a transfer rule among clusters, predicting a resource demand state according to the cluster to which the current root factor dimension information belongs, and generating a resource adjustment demand; The demand correction unit is used for analyzing the abnormal information from the abnormal feature vector, correcting the resource adjustment demand based on loop dependence analysis and dividing the resource adjustment demand into an instant adjustment component and a reserved adjustment component; The resource allocation unit is used for solving the resource allocation proportion and generating an environment tuning instruction aiming at root factor dimension information and propagation depth information of abnormal feature vectors corresponding to the plurality of artificial intelligent models in the sandbox environment.
9. An electronic device, comprising: A processor; A memory for storing processor-executable instructions; Wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1 to 7.
10. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 7.

Description

Dynamic monitoring and optimizing method and system for artificial intelligent sandbox environment Technical Field The invention relates to the technical field of artificial intelligence safety, in particular to a dynamic monitoring and optimizing method and a system for an artificial intelligence sandbox environment. Background In the field of artificial intelligence model development and testing, sandboxed environments are widely used to isolate operational models to evaluate their performance, safety, and resource consumption. The prior art generally relies on static or threshold triggered monitoring of sandboxed environments. The conventional method is to deploy an independent monitoring agent, periodically collect basic resource indexes such as the occupancy rate of a central processing unit, the use amount of a memory, input and output delay and the like of a model, and compare the basic resource indexes with a preset static threshold value. Once an indicator exceeds a threshold, the system triggers an alarm or performs a predefined resource adjustment action, such as allocating more memory or computational core for the model instance. In addition, some schemes record the output log of the model, and obvious errors or abnormal output behaviors are detected through simple keyword matching or rule engines. However, the conventional monitoring and tuning methods described above have significant drawbacks. Static threshold monitoring is difficult to adapt to the dynamic and variable load characteristics of an artificial intelligent model, false alarm or missing alarm is easy to generate, and resource adjustment actions are wasted too frequently or seriously lag to influence the performance of the model. More importantly, the prior art methods generally treat each monitored indicator as an isolated data point, lacking analysis of inter-indicator inherent correlations and causal dependencies. When an anomaly occurs, the system cannot effectively distinguish between the root cause and the associated symptoms, and it is difficult to precisely locate the origin of the anomaly and its propagation path within the system. The follow-up resource adjustment decision is lack of pertinence, the linkage problem caused by complex dependency can not be relieved fundamentally, the adjustment efficiency is low, and the effect is unstable. Disclosure of Invention The embodiment of the invention provides a dynamic monitoring and optimizing method and a system for an artificial intelligence sandbox environment, which can solve the problems in the prior art. In a first aspect of the embodiment of the present invention, a method for dynamically monitoring and optimizing an artificial intelligence sandbox environment is provided, including: Acquiring execution state data of an artificial intelligent model running in a sandbox environment, wherein the execution state data comprises resource occupation information and output behavior characteristics; Performing anomaly identification on the execution state data, constructing a directed dependency graph by taking each monitoring dimension as a node, determining an anomaly starting node through reverse tracing, determining an influence propagation link through forward traversal, and generating an anomaly characteristic vector; Extracting root factor dimension information from the abnormal feature vector, carrying out segmented clustering on the historical resource adjustment record according to the root factor dimension information, identifying a transfer rule among clusters, predicting a resource demand state according to the cluster to which the current root factor dimension information belongs, and generating a resource adjustment demand; Analyzing abnormal information from the abnormal feature vector, and correcting the resource adjustment requirement based on loop dependence analysis, wherein the abnormal information is divided into an instant adjustment component and a reserved adjustment component; aiming at root factor dimension information and propagation depth information of abnormal feature vectors respectively corresponding to a plurality of artificial intelligent models in a sandbox environment, solving a resource allocation proportion and generating an environment tuning instruction. In an alternative embodiment, performing anomaly identification on the execution state data, constructing a directed dependency graph by using each monitoring dimension as a node, determining an anomaly starting node through reverse traceback, determining an influence propagation link through forward traversal, and generating an anomaly characteristic vector includes: Taking each monitoring dimension in the execution state data as a node, calculating the time when the numerical value of different monitoring dimensions changes in adjacent time windows, sequencing each node according to the numerical value change time, establishing a directed edge from a preceding node to a subsequent node