CN-122019232-A - Dynamic health management method for server cluster, electronic equipment and storage medium

CN122019232ACN 122019232 ACN122019232 ACN 122019232ACN-122019232-A

Abstract

The application discloses a dynamic health management method, electronic equipment and a storage medium of a server cluster, which relate to the technical field of server clusters, and are used for converting multi-source data into a first feature sequence to realize multi-mode data fusion, inputting the first feature sequence into a health management model to obtain a health prediction result, wherein the health prediction result at least comprises a first prediction result and a second prediction result, the first prediction result is used for representing the fault type of a target server, the second prediction result is used for representing the latent fault propagation path of the target server in the target cluster, the health management model is used for supporting cross-layer fault positioning, a dynamic adjustment strategy is determined according to the health prediction result, and corresponding dynamic adjustment operation is executed according to the dynamic adjustment strategy to complete dynamic health management of the target cluster, so that the target server is adjusted in advance to ensure the health state of the target cluster.

Inventors

GAO FEI
ZHANG DONG
GUO TAO

Assignees

济南浪潮数据技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260119

Claims (10)

1.A method for dynamic health management of a server cluster, the method comprising: Collecting multi-source data on a target server; Preprocessing the multi-source data to obtain a first characteristic sequence; Inputting the first feature sequence into a health management model to obtain a health prediction result, wherein the health prediction result at least comprises a first prediction result and a second prediction result, the first prediction result is used for representing the fault type of the target server, and the second prediction result is used for representing the potential fault propagation path of the target server in a target cluster; and determining a dynamic adjustment strategy according to the health prediction result, and executing corresponding dynamic adjustment operation according to the dynamic adjustment strategy so as to complete dynamic health management of the target cluster.
2. The method of claim 1, wherein preprocessing the multi-source data to obtain a first signature sequence comprises: carrying out noise reduction processing on the sensor data in the multi-source data to obtain a first characteristic subsequence; Extracting log data in the multi-source data to obtain a second characteristic subsequence; carrying out normalization processing on network data in the multi-source data to obtain a third characteristic subsequence; encoding the time stamp in the multi-source data to obtain a fourth characteristic subsequence; Encoding the hardware type in the multi-source data to obtain a fifth characteristic subsequence; and splicing the first characteristic subsequence, the second characteristic subsequence, the third characteristic subsequence, the fourth characteristic subsequence and the fifth characteristic subsequence to obtain the first characteristic sequence.
3. The method of claim 1, wherein said inputting the first signature sequence into a health management model to obtain a health prediction result comprises: inputting the first characteristic sequence into a time sequence convolution layer in the health management model to obtain a second characteristic sequence; Inputting the second characteristic sequence into a long-short-time memory layer in the health management model to obtain a final hidden state; Determining the first prediction result according to the final hidden state; The first feature sequence at least comprises a plurality of time steps and a plurality of first feature vectors, the plurality of first feature vectors are in one-to-one correspondence with the plurality of time steps, the dimension values of the plurality of first feature vectors are all first dimension values, the second feature sequence at least comprises a plurality of time steps and a plurality of second feature vectors, the plurality of second feature vectors are in one-to-one correspondence with the plurality of time steps, the dimension values of the plurality of second feature vectors are all second dimension values, and the second dimension values are smaller than the first dimension values.
4. A method according to claim 3, wherein said inputting the first signature sequence into a time-sequential convolution layer in the health management model to obtain a second signature sequence comprises: inputting the first characteristic sequence into a first time sequence convolution network of the time sequence convolution layer to obtain a first intermediate sequence, wherein the first convolution kernel corresponding to the first time sequence convolution network is a first convolution value, the corresponding first expansion coefficient is a first expansion value, and the corresponding first channel number is a first channel value; Inputting the first intermediate sequence into a second time sequence convolution network of the time sequence convolution layer to obtain a second intermediate sequence, wherein the second convolution kernel corresponding to the second time sequence convolution network has a second convolution value, the corresponding second expansion coefficient has a second expansion value, the corresponding second channel number has a second channel value, the second convolution value is smaller than the first convolution value, the second expansion value is larger than the first expansion value, and the second channel number is smaller than the first channel number; Inputting the second intermediate sequence into a third time sequence convolution network of the time sequence convolution layer to obtain the second characteristic sequence, wherein the size of a third convolution kernel corresponding to the third time sequence convolution network is a third convolution value, the size of a corresponding third expansion coefficient is a third expansion value, the size of a corresponding third channel number is a third channel value, the third convolution value is smaller than the second convolution value, the third expansion value is larger than the second expansion value, the third channel number is equal to the value of the second dimension, and the third channel number is smaller than the second channel number.
5. A method according to claim 3, wherein said inputting the second sequence of features into a long and short term memory layer in the health management model to obtain a final hidden state comprises: Transpose the second feature sequence to obtain a third feature sequence, wherein the third feature sequence at least comprises a plurality of time steps and a plurality of third feature vectors, and the third feature vectors are in one-to-one correspondence with the time steps; Calculating a target hiding state and a target cell state according to the third feature sequence, wherein the target hiding state is a hiding state of a feature vector corresponding to the next-to-last time step, and the target cell state is a cell state of the feature vector corresponding to the next-to-last time step; obtaining a target feature vector in the third feature sequence, wherein the target feature vector is a feature vector corresponding to the first time step of the last one; And calculating the final hiding state according to the target feature vector, the target hiding state and the target cell state.
6. A method according to claim 3, wherein said determining said first prediction result from said final hidden state comprises: inputting the final hidden state into a full connection layer in the health management model to obtain a fourth feature vector, wherein the dimension value of the final hidden state is a third dimension value, the dimension value of the fourth feature vector is a fourth dimension value, the fourth dimension value is the same as the value of the number of fault types of the target server, and the fourth dimension value is smaller than the third dimension value; Converting the fourth feature vector into a probability distribution vector through an activation function, wherein the probability distribution vector at least comprises a plurality of probability values, and the probability values are in one-to-one correspondence with a plurality of fault types of the target server; and determining the fault type with the maximum probability value as the first prediction result.
7. The method of claim 3, wherein said inputting the first signature sequence into a health management model to obtain a health prediction result, further comprises: constructing a correlation map based on a graph neural network layer in the health management model; Calculating the first prediction result through a propagation algorithm according to the association map to obtain the second prediction result, wherein the second prediction result at least comprises a potential fault propagation path, the potential fault propagation path at least comprises a potential fault propagation direction and a potential fault server set, the potential fault server set at least comprises a plurality of potential fault servers, and the plurality of potential fault servers and the target server are located in the same cluster; And obtaining the health prediction result according to the first prediction result and the second prediction result.
8. The method of claim 7, wherein constructing an association graph based on the graph neural network layer in the health management model comprises determining a plurality of servers in a target cluster as a plurality of graph nodes in the association graph, wherein the target server is located in the target cluster, and the plurality of servers are in one-to-one correspondence with the plurality of graph nodes; determining a plurality of map edges in the association map according to network connection relations or business dependency relations among the plurality of servers; And constructing the association graph according to the graph nodes and the graph edges.
9. An electronic device, comprising: A memory for storing a computer program; Processor for implementing the steps of the method for dynamic health management of a server cluster according to any of claims 1 to 8 when executing said computer program.
10. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, wherein the computer program, when being executed by a processor, implements the steps of the method for dynamic health management of a server cluster according to any of claims 1 to 8.

Description

Dynamic health management method for server cluster, electronic equipment and storage medium Technical Field The present application relates to the field of health management technologies of server clusters, and in particular, to a dynamic health management method, an electronic device, and a storage medium for a server cluster. Background With the continuous increase of the scale of the data center, the server cluster is subjected to fault diagnosis and fault prediction, so that potential problems of the data center can be rapidly identified and solved, system breakdown and data loss are prevented, and the continuity and stability of the service are ensured. The related art performs fault diagnosis and fault prediction on the server cluster by collecting real-time data of the server cluster and comparing the real-time data with a fixed threshold value respectively. The method has the following problems that 1, data splitting and instantaneity are insufficient, real-time data acquisition in the related technology depends on independent hardware sensors and log analysis, acquired real-time data are scattered and lack of cross-layer correlation to cause fault positioning delay, 2, static model generalization capability is weak, fault diagnosis and fault prediction are carried out on a server cluster based on a fixed threshold value, dynamic scenes such as hardware aging and environmental change cannot be adapted to, and therefore false alarm rate is high, and 3, multi-mode data fusion is lost, namely, heterogeneous data such as hardware states, log semantics and network flow are not fully integrated in the related technology, and prediction accuracy is low. Disclosure of Invention The application provides a dynamic health management method, electronic equipment and a storage medium of a server cluster, which at least solve at least one of the problems of insufficient real-time performance, weak static model generalization capability and multi-mode data fusion deficiency of related technology data. The application provides a dynamic health management method of a server cluster, which comprises the steps of collecting multi-source data on a target server, preprocessing the multi-source data to obtain a first feature sequence, inputting the first feature sequence into a health management model to obtain a health prediction result, wherein the health prediction result at least comprises a first prediction result and a second prediction result, the first prediction result is used for representing the fault type of the target server, the second prediction result is used for representing the potential fault propagation path of the target server in the target cluster, determining a dynamic adjustment strategy according to the health prediction result, and executing corresponding dynamic adjustment operation according to the dynamic adjustment strategy so as to complete dynamic health management of the target cluster. The application further provides a dynamic health management system of the server cluster, the dynamic health management system of the server cluster comprises an edge intelligent agent cluster, a cloud intelligent agent platform and a visual management platform, the cloud intelligent agent platform at least comprises a data fusion intelligent agent, a model training intelligent agent and a decision execution intelligent agent, the edge intelligent agent cluster is used for converting acquired multi-source data into a first characteristic sequence and transmitting the first characteristic sequence to the cloud intelligent agent platform, the data fusion intelligent agent is used for aggregating characteristic vectors of the edge intelligent agent cluster by adopting a federation learning framework and constructing an associated map through a graph neural network to identify potential fault propagation paths, the model training intelligent agent is used for constructing a prediction model based on a time sequence convolution layer and a long short time memory layer and dynamically adjusting model parameters of the prediction model by combining a back propagation algorithm, the decision execution intelligent agent is used for pushing health prediction results to a hardware management module or the cloud management platform to execute corresponding dynamic adjustment operation, the visual management platform is used for carrying out parameter adjustment on a strategy network according to the dynamic adjustment results, and the visual management platform is used for displaying health states of the target cluster based on the digital twin technology and at least comprises the potential fault propagation paths, the health prediction results and the dynamic adjustment strategies. The application further provides electronic equipment, which comprises a memory, a processor and a dynamic health management method, wherein the memory is used for storing a computer program, the processor is used for at least realizin