Search

US-12625971-B2 - Method and apparatus of anomaly detection of system logs based on self-supervised learning

US12625971B2US 12625971 B2US12625971 B2US 12625971B2US-12625971-B2

Abstract

Disclosed is a method and apparatus for detecting anomalies in a system log on the basis of self-supervised learning, using a language model. The method comprises performing preprocessing on the system log, generating a normal token sequence having a preset length by concatenating tokenized log lines of the system log, generating an abnormal token sequence using the normal token sequence, calculating an anomaly score for a determination target token sequence using a sentence classification model, and determining the token sequence as an abnormal system log when the calculated anomaly score is greater than a threshold value.

Inventors

  • Duk Soo Kim
  • Eui Seok Kim
  • Sang Gyoo SIM
  • Ki Ho JOO
  • Jung Won Lee
  • Jong Guk LEE
  • Jung Wook Kim
  • Sang Seok Lee
  • Seung Young Park

Assignees

  • AUTOCRYPT CO., LTD.

Dates

Publication Date
20260512
Application Date
20230802
Priority Date
20220907

Claims (16)

  1. 1 . A method of detecting anomalies in a system log on the basis of self-supervised learning, the method comprising the steps of: performing preprocessing on the system log; generating a normal token sequence having a preset length by concatenating tokenized log lines of the system log; generating an abnormal token sequence using the normal token sequence; calculating an anomaly score for a determination target token sequence using a sentence classification model; determining, when the calculated anomaly score is greater than a threshold value, the token sequence as an abnormal system log, wherein the determining includes sequentially selecting and masking token subsequences in units of log lines from the determination target token sequence; and inputting the normal token sequence and the abnormal token sequence into the sentence classification model, and training the sentence classification model to minimize a preset loss.
  2. 2 . The system log anomaly detection method according to claim 1 , wherein the determining step further includes the step of calculating a loss for each sequentially selected and masked token subsequence.
  3. 3 . The system log anomaly detection method according to claim 2 , wherein the determining step further includes the step of determining an abnormal token sequence on the basis of an anomaly score obtained by summing the calculated loss of each token subsequence.
  4. 4 . The system log anomaly detection method according to claim 1 , wherein the step of generating an abnormal token sequence uses a method of deleting some log lines among the log lines of the normal token sequence.
  5. 5 . The system log anomaly detection method according to claim 1 , wherein the step of generating an abnormal token sequence uses a method of adding log lines generated at other arbitrary time points to the normal token sequence.
  6. 6 . The system log anomaly detection method according to claim 1 , wherein the step of generating an abnormal token sequence uses a method of swapping some log lines of the normal token sequence with log lines generated at different time points.
  7. 7 . The system log anomaly detection method according to claim 1 , wherein the step of calculating an anomaly score includes the step of calculating the anomaly score using a negative log likelihood (NLL) of a Gaussian density estimation (GDE) or a kernel density estimation (KDE) for a feature vector obtained from the sentence classification model.
  8. 8 . The system log anomaly detection method according to claim 1 , wherein the step of performing preprocessing includes the steps of: removing numbers and special characters from the system log; and removing single alphabets with leading and trailing spaces or alphabets of a specific length or shorter.
  9. 9 . The system log anomaly detection method according to claim 1 , wherein the step of performing preprocessing includes the step of converting each log line, from which the alphabets are removed, into a token subsequence using a tokenizer.
  10. 10 . The system log anomaly detection method according to claim 9 , wherein the step of performing preprocessing further includes the step of, when a length of the token subsequence generated through the converting step is greater than an arbitrary first length, allowing the length of the generated token subsequence to be only as long as the first length.
  11. 11 . The system log anomaly detection method according to claim 1 , wherein the step of generating a normal token sequence includes the step of generating a normal token sequence, of which a maximum length is a second length, by concatenating token subsequences.
  12. 12 . The system log anomaly detection method according to claim 11 , wherein the step of generating a normal token sequence further includes the step of inserting a special token between two adjacent token subsequences to distinguish the two token subsequences.
  13. 13 . An apparatus for detecting anomalies in a system log on the basis of self-supervised learning, the apparatus comprising: a processor; and a memory for storing at least one command executed by the processor, wherein the processor is configured to perform, by the at least one command, the steps of: performing preprocessing on the system log; generating a normal token sequence having a preset length by concatenating tokenized log lines of the system log; generating an abnormal token sequence using the normal token sequence; calculating an anomaly score for a determination target token sequence using a sentence classification model; determining, when the calculated anomaly score is greater than a threshold value, the token sequence as an abnormal system log, wherein the determining includes sequentially selecting and masking token subsequences in units of log lines from the determination target token sequence, calculating a loss for each sequentially selected and masked token subsequence, and determining an abnormal token sequence on the basis of an anomaly score obtained by summing the calculated loss of each token subsequence; and inputting the normal token sequence and the abnormal token sequence into the sentence classification model, and training the sentence classification model to minimize a preset loss.
  14. 14 . The system log anomaly detection apparatus according to claim 13 , wherein at the step of generating an abnormal token sequence, the processor deletes some log lines among the log lines of the normal token sequence.
  15. 15 . The system log anomaly detection apparatus according to claim 13 , wherein at the step of generating an abnormal token sequence, the processor adds log lines generated at other arbitrary time points to the normal token sequence.
  16. 16 . The system log anomaly detection apparatus according to claim 13 , wherein at the step of generating an abnormal token sequence, the processor swaps some log lines of the normal token sequence with log lines generated at different time points.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to Korean Patent Application No. 10-2022-0113456, filed on Sep. 7, 2022, with the Korean Intellectual Property Office (KIPO), the entire contents of which are hereby incorporated by reference. BACKGROUND 1. Technical Field The present disclosure relates to a method and apparatus for detecting anomalies in a system log on the basis of self-supervised learning, using a language model. 2. Related Art A log file refers to a file that records events or communication messages with the outside, which are generated when an operating system or software of a computing device is executed. An event log is records of an event generated when a system is executed to monitor and track operations or actions of a computer system, and is used to understand system activities and diagnose problems. Event logs are also referred to as system logs. Anomaly detection is a technique of detecting values that are not ‘normal’, i.e., those defined as an abnormal value, according to data or systems. Log anomaly detection among the anomaly detection is a technique of tracking causes of a failure by viewing the log recorded by the system, and it generally deals with text data and uses a simple technique based on pattern matching to solve problems. On the other hand, when failure messages are continuously added or excluded, log anomaly detection is difficult to solve the problems by applying pattern matching. Therefore, the problems described above are solved recently by performing log anomaly detection using a methodology based on deep learning. The anomaly detection technique based on deep learning is classified into anomaly detection based on supervised learning, anomaly detection based on semi-supervised learning (one-class), and anomaly detection based on unsupervised learning, according to whether an abnormal sample is used and whether there is a label, and unsupervised learning anomaly detection methodology based on an autoencoder is mainly used for anomaly detection based on unsupervised learning. As described above, although the anomaly detection technique uses a methodology based on deep learning to increase its coverage and reliability, even now, there is still a demand for a light and fast new methodology for anomaly detection technique based on deep learning, or a demand for improved performance and higher reliability of anomaly detection. SUMMARY The present disclosure has been derived to meet the needs of the prior art described above, and an object of the present disclosure is to provide a method and apparatus for detecting anomalies in a system log, in which a cut-and-paste technique used for existing abnormal image detection is applied to anomaly detection based on a language model. Another object of the present disclosure is to provide a method and apparatus for detecting anomalies in a system log on the basis of self-supervised learning, which can effectively determine system log anomalies by adopting, when the cut-and-paste technique is applied to anomaly detection based on a language model, a method of training a sentence classification model using generation of a preprocessed normal token sequence and an abnormal token sequence generated from the normal token sequence, calculating an anomaly score for a determination target token sequence input into the sentence classification model, and comparing the calculated anomaly score with a threshold value. According to a first exemplary embodiment of the present disclosure, a method of detecting anomalies in a system log on the basis of self-supervised learning may comprise: performing preprocessing on the system log; generating a normal token sequence having a preset length by concatenating tokenized log lines of the system log; generating an abnormal token sequence using the normal token sequence; calculating an anomaly score for a determination target token sequence using a sentence classification model; and determining, when the calculated anomaly score is greater than a threshold value, the token sequence as an abnormal system log. The system log anomaly detection method may further comprise inputting the normal token sequence and the abnormal token sequence into the sentence classification model, and training the sentence classification model to minimize a preset loss. The determining may include sequentially selecting and masking token subsequences in units of log lines from the determination target token sequence. The determining may further include calculating a loss for each sequentially selected and masked token subsequence. The determining may further include determining an abnormal token sequence on the basis of an anomaly score obtained by summing the calculated loss of each token subsequence. The generating of an abnormal token sequence may use a method of deleting some log lines among the log lines of the normal token sequence. The generating of an abnormal token sequence may use a meth