Search

CN-122020454-A - Method, device, equipment and medium for detecting abnormal operation of operation

CN122020454ACN 122020454 ACN122020454 ACN 122020454ACN-122020454-A

Abstract

The application discloses a method, a device, equipment and a medium for detecting abnormal operation of a job, which are used for rapidly and accurately judging whether the operation of the job is abnormal or not. The method comprises the steps of determining a target model and target input characteristics applicable to a currently running job based on the target job type of the job, obtaining a time sequence characteristic vector of the target input characteristics of the job in the acquisition period for any acquisition period, inputting the time sequence characteristic vector of the target input characteristics into the target model, obtaining a normal time sequence characteristic vector of the target input characteristics of the job in the acquisition period, determining an abnormal grading value of the job in the acquisition period according to the deviation degree between the time sequence characteristic vector of the target input characteristics and the normal time sequence characteristic vector, and judging whether the job is abnormal in the acquisition period according to the magnitude relation between the abnormal grading value and a preset abnormal threshold value, so that the purpose of rapidly and accurately judging whether the operation of the job is abnormal can be realized.

Inventors

  • QIN YUNHUI
  • ZHANG TAO
  • XU SHIXIN
  • HAO WENJING

Assignees

  • 曙光信息产业(北京)有限公司
  • 曙光信息产业股份有限公司

Dates

Publication Date
20260512
Application Date
20260104

Claims (14)

  1. 1. A job operation abnormality detection method, characterized by comprising: Determining a target model and a target input feature corresponding to the target job type according to a mapping relation between the pre-configured job type, the model and the input feature, wherein the target input feature comprises at least one performance feature of equipment for running the job; The method comprises the steps of respectively collecting characteristic values of each target input characteristic of the operation at a plurality of preset time points of the collection period according to any collection period, sequentially splicing the characteristic values of each target input characteristic according to collection time to obtain time sequence characteristic vectors of each target input characteristic, inputting the time sequence characteristic vectors of each target input characteristic into a target model, determining normal time sequence characteristic vectors of each target input characteristic of the operation in the collection period according to output results of the target model, respectively determining deviation degree between the time sequence characteristic vectors of each target input characteristic and the corresponding normal time sequence characteristic vectors, determining abnormal grading values of the operation in the collection period according to the deviation degree, and judging whether the operation has abnormality in the collection period according to the abnormal grading values and preset abnormal thresholds.
  2. 2. The method of claim 1, wherein the training process of the target model comprises: Acquiring sample time sequence feature vectors of each target input feature in any acquisition period of the target operation type operation in a sample set, wherein the sample time sequence feature vectors correspond to first time sequence feature vector labels; Inputting the sample time sequence feature vector into a target model to be trained, and determining a first identification time sequence feature vector of the sample time sequence feature vector through the target model to be trained; And training the target model to be trained according to the first time sequence feature vector label and the first recognition time sequence feature vector to obtain a target model after training.
  3. 3. The method of claim 1, wherein said inputting a time series feature vector for each target input feature into the target model comprises: And inputting the time sequence feature vector of each target input feature and the context information of the job into the target model, wherein the context information comprises at least one of scale information and operation environment information of equipment for operating the job.
  4. 4. A method according to claim 3, wherein the training process of the target model comprises: Acquiring sample time sequence feature vectors and sample context information of each target input feature in any acquisition period of the target operation type operation in the sample set, wherein the sample time sequence feature vectors and the sample context information correspond to second time sequence feature vector labels; inputting the sample time sequence feature vector and the sample context information into a target model to be trained, and determining a second identification time sequence feature vector through the target model to be trained; and training the target model to be trained according to the second time sequence feature vector label and the second recognition time sequence feature vector to obtain a target model after training.
  5. 5. The method according to any one of claims 1-4, wherein said inputting the timing feature vector of each target input feature into the target model, and determining the normal timing feature vector of each target input feature of the job in the acquisition period based on the output result of the target model, comprises: Inputting the time sequence feature vector of each target input feature, the context information of the job and the log information of the job in the acquisition period into the target model, and determining the normal time sequence feature vector of each target input feature of the job in the acquisition period and the normal progress information of the job in the acquisition period according to the output result of the target model; the determining the deviation degree between the time sequence characteristic vector of each target input characteristic and the corresponding normal time sequence characteristic vector comprises the following steps: and respectively determining a first deviation degree between the time sequence characteristic vector of each target input characteristic and a corresponding normal time sequence characteristic vector, and determining a second deviation degree between actual progress information and normal progress information of the job carried in the log information in the acquisition period.
  6. 6. The method of claim 5, wherein the training process of the target model comprises: acquiring sample time sequence feature vectors, sample context information and sample log information of each target input feature in any acquisition period of the target operation type operation in a sample set, wherein the sample time sequence feature vectors, the sample context information and the sample log information are corresponding to a third time sequence feature vector label and a normal progress information label of the target operation type operation in the acquisition period; inputting the sample time sequence feature vector, the sample context information and the sample log information into a target model to be trained, and determining a third identification time sequence feature vector and identification progress information of the target job type operation in the acquisition period through the target model to be trained; and training the target model to be trained according to the third time sequence feature vector label, the third identification time sequence feature vector, the identification progress information and the normal progress information label to obtain a target model after training.
  7. 7. The method of any one of claims 1-4, wherein the target model comprises a time-series neural network.
  8. 8. The method according to any one of claims 1-4, further comprising: And optimizing the target model based on whether the operation has an abnormality in the acquisition period.
  9. 9. The method of any of claims 1-4, wherein identifying the target job type for the currently running job comprises: Identifying the program name of the program called by the job, determining the target job type of the job according to the corresponding relation between the pre-stored program name and the job type, or Identifying the file information under the path of the job, determining the target job type of the job according to the corresponding relation between the pre-stored file information and the job type, or And identifying a loading deployment mode of the application in the job, and determining a target job type of the job according to a pre-stored corresponding relation between the loading deployment mode and the job type.
  10. 10. The method according to any one of claims 1-4, further comprising: if the job has an abnormality in the acquisition period and the target job type is a job type applicable to a preset state saving mechanism, saving the running state of the job when the job has no abnormality before the acquisition period based on a Burst Buffer.
  11. 11. A job operation abnormality detection device, characterized by comprising: the system comprises an identification module, a target operation type identification module, a target input feature identification module and a control module, wherein the identification module is used for identifying a target operation type of a currently operated operation; The detection module is used for respectively collecting characteristic values of each target input characteristic of the operation at a plurality of preset time points of the collection period according to any collection period, sequentially splicing the characteristic values of each target input characteristic according to the collection time to obtain time sequence characteristic vectors of each target input characteristic, inputting the time sequence characteristic vectors of each target input characteristic into the target model, determining normal time sequence characteristic vectors of each target input characteristic of the operation in the collection period according to an output result of the target model, respectively determining deviation degree between the time sequence characteristic vectors of each target input characteristic and the corresponding normal time sequence characteristic vectors, determining an abnormal grading value of the operation in the collection period based on the deviation degree, and judging whether the operation has an abnormality in the collection period according to the abnormal grading value and a preset abnormal threshold.
  12. 12. An electronic device comprising at least a processor and a memory, the processor being configured to implement the steps of the job run anomaly detection method according to any one of claims 1 to 10 when executing a computer program stored in the memory.
  13. 13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a computer, cause the computer to perform the job execution anomaly detection method according to any one of claims 1 to 10.
  14. 14. A computer program product, characterized in that the computer program product comprises computer program code which, when run on a computer, causes the computer to perform the job run anomaly detection method according to any one of the preceding claims 1-10.

Description

Method, device, equipment and medium for detecting abnormal operation of operation Technical Field The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a medium for detecting abnormal operation of a job. Background In the super-computing internet platform related to high-performance computing (High performance computing, HPC), along with the continuous increase of the computing scale and the concurrent job number, the method for efficiently and accurately judging whether the job operation is abnormal is an important technical means for ensuring the stable operation of the system and the efficient utilization of resources. In the related art, when judging whether or not the operation of a job is abnormal, for example, it is generally judged whether or not the operation of the job is abnormal based on a magnitude relation between a CPU utilization of a central processing unit (Central Processing Unit, CPU) and a preset CPU utilization threshold. If the CPU utilization rate is lower than 10%, the operation is considered to have abnormality, otherwise, the operation is considered to have no abnormality. However, different types of jobs have different resource usage characteristics, for example, input/Output (I/O) intensive tasks may have low CPU utilization for a long time, and in the related art, such a manner of using a "one-cut" judgment standard to judge whether the job is abnormal in operation easily causes frequent misjudgment, and seriously affects the accuracy and reliability of the judgment. Therefore, a technical solution that can quickly and accurately determine whether the operation is abnormal is needed. Disclosure of Invention The application provides a method, a device, equipment and a medium for detecting abnormal operation of a job, which are used for rapidly and accurately judging whether the operation of the job is abnormal or not. In a first aspect, the present application provides a job operation abnormality detection method, the method including: Determining a target model and a target input feature corresponding to the target job type according to a mapping relation between the pre-configured job type, the model and the input feature, wherein the target input feature comprises at least one performance feature of equipment for running the job; The method comprises the steps of respectively collecting characteristic values of each target input characteristic of the operation at a plurality of preset time points of the collection period according to any collection period, sequentially splicing the characteristic values of each target input characteristic according to collection time to obtain time sequence characteristic vectors of each target input characteristic, inputting the time sequence characteristic vectors of each target input characteristic into a target model, determining normal time sequence characteristic vectors of each target input characteristic of the operation in the collection period according to output results of the target model, respectively determining deviation degree between the time sequence characteristic vectors of each target input characteristic and the corresponding normal time sequence characteristic vectors, determining abnormal grading values of the operation in the collection period according to the deviation degree, and judging whether the operation has abnormality in the collection period according to the abnormal grading values and preset abnormal thresholds. By the method, the target operation type of the currently operated operation can be identified, the target model and the target input characteristic suitable for the operation can be determined according to the mapping relation between the pre-configured operation type, the model and the input characteristic, then the time sequence characteristic vector of the target input characteristic of the operation in the acquisition period can be obtained for any acquisition period, the time sequence characteristic vector of the target input characteristic is input into the target model, the normal time sequence characteristic vector of the target input characteristic of the operation in the acquisition period is determined according to the output result of the target model, the abnormal grading value of the operation in the acquisition period can be determined according to the deviation degree between the time sequence characteristic vector of the target input characteristic and the normal time sequence characteristic vector, finally whether the operation is abnormal or not can be judged according to the magnitude relation between the abnormal grading value and the preset abnormal threshold, and on the basis of the time sequence characteristic vector, judgment of the operation is avoided to judge whether the operation is abnormal or not by adopting the judgment standard of one cut in the related technology, the judgment of w