CN-121560628-B - DPU fault prediction method and system
Abstract
The application discloses a DPU fault prediction method and a system, which relate to the technical field of computer fault prediction and are used for comprehensively acquiring in-band and out-of-band data, and intelligent fusion and cross verification are performed, so that the prediction precision is effectively improved, the fault prediction accuracy can be improved, the false alarm rate is reduced, and early fault identification and active intervention are realized.
Inventors
- ZHANG JING
Assignees
- 四川华鲲振宇智能科技有限责任公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260122
Claims (7)
- 1. The DPU fault prediction method is characterized by comprising the following steps of: acquiring an in-band service data set through an in-band data acquisition module, and acquiring an out-of-band basic data set through an out-of-band data acquisition module; Inputting the in-band service data set and the out-of-band basic data set into a collaborative data preprocessing module for time alignment and intelligent fusion processing so as to obtain a collaborative feature data set; inputting the collaborative feature data set into an out-of-band model training module for model construction and training processing to obtain a fault prediction model; inputting the collaborative feature data set acquired in real time into the fault prediction model to perform fault prediction processing so as to acquire a fault pre-judging result; inputting the fault pre-judging result, the in-band service data set acquired in real time and the out-of-band basic data set acquired in real time into a cooperative fault detection module for bidirectional cross-validation processing so as to acquire a fault confirmation result; executing corresponding fault processing operation according to the fault confirmation result; the step of inputting the in-band service data set and the out-of-band basic data set into a collaborative data preprocessing module for time alignment and intelligent fusion processing to obtain a collaborative feature data set comprises the following steps: Performing time stamp alignment processing on the in-band service data set and the out-of-band basic data set through a time alignment unit in the collaborative data preprocessing module so as to acquire the in-band service data set after time alignment and the out-of-band basic data set after time alignment; Carrying out attention mechanism weighted fusion processing on the in-band service data set after time alignment and the out-of-band basic data set after time alignment through an intelligent fusion unit in the collaborative data preprocessing module so as to acquire the collaborative feature data set; The step of carrying out attention mechanism weighted fusion processing on the in-band service data set after time alignment and the out-of-band basic data set after time alignment through an intelligent fusion unit in the collaborative data preprocessing module so as to acquire the collaborative feature data set comprises the following steps: Calculating a first importance score for each item of data in the time-aligned in-band service data set through an importance analysis subunit in the intelligent fusion unit, and calculating a second importance score for each item of data in the time-aligned out-of-band basic data set to obtain a first importance score set and a second importance score set; Performing differential weighting calculation on the time-aligned in-band service data set and the time-aligned out-of-band basic data set based on the first importance score set and the second importance score set through an attention weighting subunit in the intelligent fusion unit so as to obtain a weighted fusion data set; extracting multidimensional time series features from the weighted fusion data set through a feature extraction subunit in the intelligent fusion unit so as to obtain the collaborative feature data set; Inputting the fault pre-judging result, the in-band service data set acquired in real time and the out-of-band basic data set acquired in real time into a collaborative fault detection module for bidirectional cross-validation processing so as to acquire a fault confirmation result, wherein the step of acquiring the fault confirmation result comprises the following steps of: Comparing the fault pre-judging result with an abnormal index in the in-band service data set acquired in real time through an in-band verification unit in the collaborative fault detection module to acquire a first time consistency score; Comparing the fault pre-judging result with an abnormal index in the out-of-band basic data set acquired in real time through an out-of-band verification unit in the collaborative fault detection module to acquire a second time consistency score; and judging the reliability of the fault pre-judging result based on the first time consistency score and the second time consistency score by a comprehensive judging unit in the collaborative fault detection module so as to acquire the fault confirmation result.
- 2. The DPU failure prediction method as claimed in claim 1, wherein the step of acquiring the in-band traffic data set by the in-band data acquisition module and the out-of-band base data set by the out-of-band data acquisition module comprises: Acquiring network flow data, storage access data and encryption processing data from a service processing unit of the DPU in real time through the in-band data acquisition module so as to acquire the in-band service data set; And acquiring temperature data, voltage data, current data and frequency data from a hardware monitoring unit and an environment sensor of the DPU in real time through the out-of-band data acquisition module so as to acquire the out-of-band basic data set.
- 3. The DPU failure prediction method of claim 1, wherein the step of inputting the collaborative feature data set into an out-of-band model training module for model construction and training processing to obtain a failure prediction model comprises: Analyzing the fault mode in the collaborative feature data set through a historical data analysis unit in the out-of-band model training module to determine fault type classification and fault class classification; constructing a time sequence neural network model structure for fault prediction through a model construction unit in the out-of-band model training module; And training the time sequence neural network model structure by using the collaborative feature data set marked with fault information through a model training unit in the out-of-band model training module so as to acquire the fault prediction model.
- 4. The DPU fault prediction method as recited in claim 3, wherein the method further comprises: and performing incremental training and parameter optimization processing on the fault prediction model based on the new collaborative feature data set obtained later by an incremental learning unit in the out-of-band model training module so as to obtain an optimized fault prediction model.
- 5. The DPU fault prediction method as claimed in claim 1, wherein the step of performing a corresponding fault handling operation based on the fault confirmation result comprises: determining a fault grade according to the fault severity degree in the fault confirmation result by a fault grade judging unit in the fault response module; Matching corresponding automatic processing strategies according to the fault level through a response strategy matching unit in the fault response module so as to acquire the automatic processing strategies; and executing corresponding fault processing operation according to the automatic processing strategy through an execution unit in the fault response module.
- 6. The DPU fault prediction method as claimed in claim 5, wherein the step of performing, by an execution unit in the fault response module, a corresponding fault handling operation in accordance with the automated handling policy comprises: when the fault grade is the first grade, recording a fault log through the execution unit and sending out an early warning notice; When the fault grade is the second grade, recording a fault log through the execution unit, sending out an early warning notice and automatically adjusting system parameters; When the fault level is the third level, the execution unit records a fault log, sends out emergency notification, automatically adjusts system parameters and starts standby DPU switching flow process.
- 7. A DPU fault prediction system comprising a memory, a processor, and a DPU fault prediction program stored on the memory and executable on the processor, the DPU fault prediction program configured to implement the DPU fault prediction method of any one of claims 1 to 6.
Description
DPU fault prediction method and system Technical Field The application relates to the technical field of computer fault prediction, in particular to a DPU fault prediction method and a DPU fault prediction system. Background In data centers and cloud computing infrastructure, the data processing unit acts as a core computing component, and its operational stability directly affects overall service continuity. With the improvement of the integration level and the function complexity of the DPU, the risks of hardware faults and business anomalies are obviously increased. The traditional operation and maintenance mode relies on manual periodic inspection and passive fault response, has the inherent defects of large monitoring blind area, strong response hysteresis and the like, and is difficult to realize early identification and active intervention of faults. In the prior art, failure prediction is attempted through data acquisition, but key limitations generally exist in that part of schemes only acquire network traffic or storage access data in a service channel, physical layer information of a hardware monitoring unit and an environment sensor is ignored, so that failure cause analysis dimension is single, and other schemes synchronously acquire multi-source data, but lack of an effective space-time alignment mechanism and a feature fusion strategy, in-band service data and out-of-band basic data are in a splitting state, and relevance of cross-dimension data cannot be mined. Especially in the data fusion link, the existing method adopts a fixed weight superposition or simple splicing mode, and fails to implement self-adaptive weighting according to the dynamic importance of data, so that the fusion characteristic cannot accurately represent the actual running state of the DPU, and finally, the failure prediction accuracy is low and the false alarm rate is high. The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art. Disclosure of Invention The application mainly aims to provide a DPU fault prediction method and a DPU fault prediction system, aiming at improving the fault prediction accuracy. In order to achieve the above object, the present application provides a DPU fault prediction method, which includes: acquiring an in-band service data set through an in-band data acquisition module, and acquiring an out-of-band basic data set through an out-of-band data acquisition module; Inputting the in-band service data set and the out-of-band basic data set into a collaborative data preprocessing module for time alignment and intelligent fusion processing so as to obtain a collaborative feature data set; inputting the collaborative feature data set into an out-of-band model training module for model construction and training processing to obtain a fault prediction model; inputting the collaborative feature data set acquired in real time into the fault prediction model to perform fault prediction processing so as to acquire a fault pre-judging result; inputting the fault pre-judging result, the in-band service data set acquired in real time and the out-of-band basic data set acquired in real time into a cooperative fault detection module for bidirectional cross-validation processing so as to acquire a fault confirmation result; And executing corresponding fault processing operation according to the fault confirmation result. In one embodiment, the step of acquiring the in-band service data set by the in-band data acquisition module and the out-of-band base data set by the out-of-band data acquisition module includes: Acquiring network flow data, storage access data and encryption processing data from a service processing unit of the DPU in real time through the in-band data acquisition module so as to acquire the in-band service data set; And acquiring temperature data, voltage data, current data and frequency data from a hardware monitoring unit and an environment sensor of the DPU in real time through the out-of-band data acquisition module so as to acquire the out-of-band basic data set. In one embodiment, the step of inputting the in-band service data set and the out-of-band basic data set into a collaborative data preprocessing module to perform time alignment and intelligent fusion processing to obtain a collaborative feature data set includes: Performing time stamp alignment processing on the in-band service data set and the out-of-band basic data set through a time alignment unit in the collaborative data preprocessing module so as to acquire the in-band service data set after time alignment and the out-of-band basic data set after time alignment; And carrying out attention mechanism weighted fusion processing on the in-band service data set after time alignment and the out-of-band basic data set after time alignment through an intelligent fusion unit