CN-121441805-B - Computing node ghost resource detection method and system based on network traffic
Abstract
The invention provides a method and a system for detecting ghost resources of a computing node based on network traffic, belonging to the field of data center resource management and network observability. The method comprises the steps of collecting communication flow between a computing node and a management scheduling node in real time, carrying out protocol classification and flow characteristic extraction, respectively identifying and extracting heartbeat flow and business I/O flow, presetting sliding time windows, respectively counting the number of data packets of the heartbeat flow and the business I/O flow in the sliding time windows, calculating the difference value in each sliding time window, setting a difference value threshold value theta, and judging the node as a ghost resource node when the smoothed difference value continuously exceeds the threshold value theta in N windows, and the duration T > T0, the CPU utilization rate <5 and the disk I/O rate <1MB/s of the current computing node are occupied, generating ghost resource node alarms and treating the ghost resource node. The invention improves the real-time performance, accuracy and universality of detection.
Inventors
- Hai Wanxue
- LIAO SHUIPING
- Rong Zengjun
- FENG GUANGWEI
Assignees
- 北京网深科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251112
Claims (10)
- 1. The method for detecting the ghost resources of the computing node based on the network traffic is characterized by comprising the following steps: step S1, collecting communication flow between a computing node and a management scheduling node in real time, wherein the communication flow at least comprises heartbeat flow and business I/O data flow; step S2, carrying out protocol classification and flow characteristic extraction on communication flow, and respectively identifying and extracting heartbeat flow and service I/O flow; Step S3, presetting a sliding time window, and respectively counting the quantity H (t) and D (t) of data packets of the heartbeat flow and the service I/O flow in the sliding time window, wherein t is a time index variable; step S4, calculating the data packet difference value of the heartbeat-service flow in each sliding time window, and carrying out time sequence smoothing and storage on the difference value; Step S5, setting a difference value threshold value theta and a duration threshold value T 0 , judging that the node is a ghost resource node when the smoothed difference value delta' (T) continuously N windows exceed the threshold value theta and the duration time T occupied by the current calculation node is more than T 0 , and turning to step S6, otherwise, judging that the node is not the ghost resource node; and S6, generating a ghost resource node alarm and treating the ghost resource node.
- 2. The method according to claim 1, wherein step S2 comprises: step S21, extracting flow characteristics and protocol types of the acquired communication flow, wherein the flow characteristics comprise average packet length APS, period stability PS and data directivity DR; step S22, a control plane protocol type library and a data transmission protocol type library are set, and a first flow characteristic condition and a second flow characteristic condition are set at the same time; Step S23, when the flow characteristic meets the first flow characteristic condition and belongs to the control plane protocol type library, marking the current data packet as heartbeat flow, when the flow characteristic meets the second flow characteristic condition and belongs to the data transmission protocol type library, marking the current data packet as business I/O flow, otherwise, marking the data packet as abnormal flow.
- 3. The method of claim 2, wherein the step of determining the position of the substrate comprises, The control plane protocol type library at least comprises Slurm RPC, MPI Rank Heartbeat, TCP KEEPALIVE, ICMP Echo and GRPC HEALTH CHECK communication protocols, and the data transmission protocol type library at least comprises MPI, RDMA, NCCL, NFS, HTTP, gRPC and a database access protocol; The first flow characteristic condition comprises APS < 128B, PS > 0.8 and DR < 0.3, and the second flow characteristic condition comprises APS > 512B, PS is less than or equal to 0.8 and DR is more than or equal to 0.3.
- 4. The method according to claim 1, wherein when the Heartbeat traffic and the traffic I/O traffic are identified and classified in step S2, a classification model based on machine learning is constructed, the model input features are traffic feature indexes, and the model output labels are { heart bean, business, unknown }, and automatic classification is performed by the machine learning model.
- 5. The method according to claim 1, wherein step S4 calculates a packet difference delta (t) as follows: Δ(t)=H(t)-α·D(t) Wherein alpha is a weight coefficient, and the value is 0.1-0.5.
- 6. The method according to claim 1, wherein the time series smoothing process of step S4 performs a moving average or an exponentially weighted smoothing on the sequence of data packet differences Δ (t) according to the following formula: Δ'(t) = β·Δ(t) + (1-β)·Δ'(t-1) wherein, the value range of beta is 0.2-0.5.
- 7. The method according to claim 1, wherein the step S5 of setting the difference threshold θ uses a moving average and a standard deviation to perform the threshold setting, comprises: dynamically calculating a node history difference mean value mu and a standard deviation sigma; Setting a threshold value theta to be theta=mu+3σ; And supporting node type self-adaption, and respectively configuring a computing node, a storage node and a GPU node.
- 8. The method of claim 1, wherein step S6 performs one of automatically releasing resources, restarting node service, or notifying an administrator to perform a manual review when managing the ghost resource nodes.
- 9. The method of claim 1, wherein step S6 further comprises visually displaying the detection result, highlighting the ghost node in red in the monitoring system interface, providing a trend curve to show delta (t) or delta' (t) change, and logging the alarm event.
- 10. The system comprises a flow acquisition module, a feature extraction module, a classification statistics module, a difference calculation module, a judgment module and an alarm and control module, The flow acquisition module is used for acquiring communication flow between the computing node and the management scheduling node in real time, wherein the communication flow at least comprises heartbeat flow and business I/O data flow; the feature extraction module is used for carrying out protocol classification and flow feature extraction on the communication flow, and respectively identifying and extracting heartbeat flow and service I/O flow; The classification statistics module is used for presetting a sliding time window and respectively counting the quantity H (t) and the quantity D (t) of data packets of the heartbeat flow and the business I/O flow in the sliding time window, wherein t is a time index variable; The difference calculation module is used for calculating the data packet difference value of the heartbeat-service flow in each sliding time window, and carrying out time sequence smoothing and storage on the difference value; the decision module is used for setting a difference value threshold value theta and a duration threshold value T 0 , deciding that the node is a ghost resource node when the smoothed difference value delta' (T) continuously N windows exceed the threshold value theta and the duration time T occupied by the current calculation node is more than T 0 , and starting an alarm and control module, otherwise, deciding that the node is not the ghost resource node; the alarm and control module is used for generating a ghost resource node alarm and treating the ghost resource node.
Description
Computing node ghost resource detection method and system based on network traffic Technical Field The invention belongs to the field of data center resource management and network observability, and particularly relates to a computing node ghost resource detection method and system based on network traffic. Background In a large data center or computing cluster, job tasks are distributed by a scheduling system to computing nodes for execution. After the task is completed, the node should release the occupied resources such as computation, memory, GPU and the like for the new task to use. However, in actual operation, a phenomenon that node resources are not normally released, namely a "Ghost Resource" problem, often occurs. At this time, although the job is marked as 'completed' at the scheduling layer, there is no actual calculation, storage or network I/O activity in the node, but the node still keeps the heartbeat or daemon, so that the system misunderstands that the resource is still occupied, the job scheduling efficiency and quality of the system are seriously affected, and the calculation node becomes a ghost resource. In order to release the ghost resources as soon as possible, it is first necessary to detect the ghost resources. In the prior art, the ghost resources in the computing nodes are detected by adopting methods of comparing and scheduling journals, counting CPU/MEM utilization rate or manually checking the node states and the like. However, the method has the defects that the method only depends on CPU/MEM indexes, is difficult to identify the node in a light load but occupied state, has coarse detection granularity, is seriously dependent on periodic scheduling logs, has large delay and poor timeliness, cannot automatically clean resources in linkage with a scheduling system, cannot realize automatic closed loop, and in addition, part of heartbeat processes are misjudged as active operation, so that the misinformation rate is high and the detection efficiency is low. Disclosure of Invention In view of the above-mentioned drawbacks or shortcomings in the prior art, the present invention aims to provide a method and a system for detecting ghost resources of a computing node based on network traffic, which analyze a node network behavior pattern based on a traffic difference between node heartbeat and traffic, and accurately identify nodes with normal heartbeat but no traffic activity, so as to improve the node resource utilization rate of a data center and reduce the operation and maintenance cost, and the method and the system are suitable for resource scheduling and operation and maintenance management of a high-performance computing cluster, an AI training platform, cloud computing and a virtualized data center. In order to achieve the above purpose, the embodiment of the present invention adopts the following technical scheme: In a first aspect, an embodiment of the present invention provides a method for detecting a ghost resource of a computing node based on network traffic, where the method includes the following steps: step S1, collecting communication flow between a computing node and a management scheduling node in real time, wherein the communication flow at least comprises heartbeat flow and business I/O data flow; step S2, carrying out protocol classification and flow characteristic extraction on communication flow, and respectively identifying and extracting heartbeat flow and service I/O flow; Step S3, presetting a sliding time window, and respectively counting the quantity H (t) and D (t) of data packets of the heartbeat flow and the service I/O flow in the sliding time window, wherein t is a time index variable; step S4, calculating the data packet difference value of the heartbeat-service flow in each sliding time window, and carrying out time sequence smoothing and storage on the difference value; Step S5, setting a difference value threshold value theta and a duration threshold value T 0, judging that the node is a ghost resource node when the smoothed difference value delta' (T) continuously N windows exceed the threshold value theta and the duration time T occupied by the current calculation node is more than T 0, and turning to step S6, otherwise, judging that the node is not the ghost resource node; and S6, generating a ghost resource node alarm and treating the ghost resource node. As a preferred embodiment of the present invention, step S2 includes: step S21, extracting flow characteristics and protocol types of the acquired communication flow, wherein the flow characteristics comprise average packet length APS, period stability PS and data directivity DR; step S22, a control plane protocol type library and a data transmission protocol type library are set, and a first flow characteristic condition and a second flow characteristic condition are set at the same time; Step S23, when the flow characteristic meets the first flow characteristic condition and belongs t