CN-122019782-A - Log extraction method based on BERT technology

CN122019782ACN 122019782 ACN122019782 ACN 122019782ACN-122019782-A

Abstract

The invention belongs to the technical field of log extraction management, and particularly discloses a log extraction method based on a BERT technology, which comprises the steps of collecting unlabeled log samples and identifying log levels; predicting a sample label sequence by using a BERT model, calculating the structural matching degree of the sample label sequence and a known log template set, predicting the sample for multiple times to evaluate the prediction stability of the sample, extracting semantic vectors to cluster to obtain a similar sample group and evaluate the group prediction stability, fusing the matching degree, the prediction stability and the log level to construct labeling requirement priority of each sample, sequencing according to the priority and introducing semantic diversity screening to obtain a target labeling sample set, and labeling the sample set and using the sample set for BERT model incremental training. According to the method, the novel log mode with uncertain model cognition can be accurately identified, the labeling resources are guided to be thrown to the cognition blind area, and meanwhile, the labeling efficiency and the model generalization capability are further improved.

Inventors

WANG RONG
XIANG NANGANG
CHANG MING
WANG YAQIN
Li Ne
SHEN ZHUOQING
BAO LINHUI
LI XIAOQING

Assignees

国网山西省电力有限公司长治供电分公司

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (10)

1. The method for extracting the log based on the BERT technology is characterized by comprising the following steps: collecting unlabeled log samples, identifying log levels of the unlabeled log samples, and constructing an unlabeled log pool; predicting each sample in the unlabeled log pool through the BERT model, and calculating the matching degree of each sample and the known log template set based on the prediction tag sequence; Predicting each sample for multiple times, respectively evaluating the prediction stability of the model on each sample based on a prediction label sequence of each prediction, extracting semantic representation vectors of each sample through a BERT model coding layer, clustering the samples based on the semantic representation vectors to obtain similar sample groups, and evaluating the prediction stability of the model on each similar sample group; fusing the matching degree, the prediction stability and the log level, constructing labeling requirement priorities of all samples, and screening target labeling requirement samples according to the labeling requirement priorities; and updating the BERT model based on the target labeling requirement sample, and carrying out structured extraction on the unlabeled log through the updated model.
2. The method for extracting logs based on BERT technology as set forth in claim 1, wherein calculating the matching degree of each sample and the known log template set includes: Mapping each tag in the predicted tag sequence to its corresponding entity type, thereby converting the predicted tag sequence to a predicted type sequence; For each template in the set of known log templates, converting its tag sequence into a sequence of template types; for each sample, performing sequence alignment operation on the predicted type sequence and the template type sequence, and calculating the structural similarity between the predicted type sequence and the template type sequence corresponding to each template in the known log template set; and selecting the highest structural similarity as the matching degree of the corresponding sample and the known log template set.
3. The method for extracting logs based on BERT technology as defined in claim 2, wherein the performing sequence alignment operation is performed by a dynamic programming algorithm, specifically comprising: Constructing a dynamic programming scoring matrix according to the lengths of the prediction type sequence and the template type sequence; Calculating a dynamic programming score of each position of the dynamic programming score matrix through a state transition equation, and recording a state transition direction when the score is obtained; and starting from the end point of the dynamic programming score matrix, backtracking to the start point according to the recorded state transition direction, and aligning the predicted type sequence and the template type sequence according to the vacancy inserting mode indicated by the backtracking path.
4. The method for extracting journals based on BERT technology according to claim 1, wherein the predicting stability of each sample by the evaluation model comprises: comparing the predicted tag sequences of each prediction, marking the predicted tag sequence with the highest occurrence number as a consensus sequence, and calculating the ratio of the occurrence number of the consensus sequence to the total predicted number of times to obtain a consensus ratio; calculating the similarity of the predicted tag sequence and the consensus sequence of each prediction, and constructing a similarity sequence; Taking the average value of the similarity sequence as the reference similarity, and taking the product of the reference similarity and the consensus proportion as the prediction stability of the model on the corresponding sample.
5. The method for extracting logs based on BERT technology as defined in claim 1, wherein the clustering the samples to obtain the similar sample group comprises: Calculating cosine similarity between every two samples based on semantic expression vectors of all samples, and constructing a sample similarity matrix; based on the sample similarity matrix and a preset similarity threshold, dividing the samples into different similar sample groups through a clustering algorithm.
6. The method for extracting logs based on BERT technique as defined in claim 5, wherein the predicting stability of the evaluation model to each similar sample group includes: for each similar sample group, calculating the average value of the prediction stability of all samples in the group, and taking the average value as the reference prediction stability of the group; for each sample, taking a predicted tag sequence with the highest occurrence number in multiple predictions as a target predicted tag sequence of the sample; Calculating the two-two Jacquard similarity between the target prediction tag sequences corresponding to all samples in the same similar sample group, constructing a group similarity sequence, and calculating the average value of the similarity sequences in the group to be used as an intra-group prediction consistency index; And taking the product of the intra-group prediction consistency index and the reference prediction stability as the prediction stability of the model on the similar sample group.
7. The method for extracting logs based on BERT technology as defined in claim 1, wherein the constructing the labeling requirement priority of each sample comprises: Setting log level weights based on the log levels; Comparing the predicted stability of each similar sample group with a preset stability threshold, and if the predicted stability of a certain similar sample group is smaller than the preset stability threshold, marking the similar sample group as a fluctuation sample group; For samples in the non-fluctuation sample group, if the matching degree of the samples and the known log template set is larger than a preset matching degree threshold, assigning the labeling requirement priority of the samples to 0, otherwise, setting the log level weight of the samples to be the labeling requirement priority; for samples within the undulating sample set, the following steps are performed: Calculating absolute deviation between the predicted stability of the sample and a preset stability threshold value, and carrying out normalization processing to obtain a fluctuation index; Subtracting the matching degree of the sample and the known log template set from 1 to obtain a novel index; And carrying out linear weighted summation on the fluctuation index, the novelty index and the log level weight of the sample to obtain the labeling requirement priority of the sample.
8. The method for extracting logs based on BERT technique as defined in claim 7, wherein setting the log level weight comprises: if the log level is an error level, setting the log level weight as ; If the log level is a warning level, setting the log level weight as ; If the log level is other levels, setting the log level weight as ; Wherein, the 。
9. The method for extracting logs based on BERT technology as defined in claim 1, wherein the step of screening the target annotation demand sample comprises the steps of: according to the labeling requirement priority of each sample, carrying out descending order sequencing on all samples to form a priority sequencing list; Initializing a target labeling requirement sample set as an empty set, and sequentially selecting samples as candidate samples from the starting position of the priority ordering list; the following steps are performed in sequence for each candidate sample: A1, if the current target labeling requirement sample set is empty, adding the candidate sample into the target labeling requirement sample set; a2, if the current target labeling requirement sample set is not empty, calculating cosine similarity of semantic representation vectors of the candidate sample and the existing samples in the target labeling requirement sample set; A3, if the cosine similarity between the candidate sample and all the existing samples is lower than a preset diversity threshold, adding the candidate sample into a target labeling requirement sample set, otherwise, skipping the candidate sample, and continuing to traverse the next candidate sample; A4, repeating the steps A1 to A3 until the number of samples in the target labeling requirement sample set reaches the target labeling number, or traversing all candidate samples, and taking the samples in the finally obtained target labeling requirement sample set as screened target labeling requirement samples.
10. The method for extracting the journal based on the BERT technique according to claim 1, wherein the method for updating the BERT model is as follows: Manually labeling the target labeling requirement sample; Combining the marked target marking requirement sample with the original training set to form an updated training set; and performing incremental training on the BERT model through the updated training set to obtain an updated BERT model.

Description

Log extraction method based on BERT technology Technical Field The invention belongs to the technical field of log extraction management, and particularly relates to a log extraction method based on a BERT technology. Background The traditional log extraction method mainly depends on manual rules or unsupervised clustering, the former has high maintenance cost and poor adaptability, and the latter has limited extraction precision. In recent years, the accuracy and generalization capability of log extraction are greatly improved by introducing a deep learning method based on a pretrained model such as BERT. However, the method essentially belongs to supervised learning, the performance of the method is highly dependent on a large amount of high-quality annotation data for fine tuning, and the problem of obvious annotation data dependence exists. To alleviate this problem, active learning strategies are currently introduced that reduce labeling costs by allowing the model to select the most valuable sample for manual labeling. However, the sample selection strategy of the existing method generally depends on only single dimension indexes such as model prediction confidence coefficient or entropy value, and the like, and has the following two aspects that firstly, the evaluation dimension is single, and semantic novel samples with larger differences from the known log mode are difficult to effectively identify, so that the labeling resource may not cover the dead zone of the model knowledge structure. Secondly, the evaluation depth is insufficient, the measurement of the prediction stability usually only focuses on the internal variance of a single sample, and consistency verification is not carried out on a similar sample group, so that the real cognitive blind area of the model is difficult to distinguish and accurately optimized. These limitations limit the improvement of annotation efficiency and model performance, failing to fundamentally alleviate the dependence on annotation data. Disclosure of Invention In view of this, in order to solve the above-mentioned problems, a log extraction method based on BERT technology is proposed. The method comprises the steps of collecting unlabeled log samples, identifying log levels of the unlabeled log samples, and constructing an unlabeled log pool. And respectively calculating the matching degree of each sample and the known log template set based on the predicted label sequence of each sample in the untagged log pool by the pre-trained BERT model. And predicting each sample for multiple times, respectively evaluating the prediction stability of the model on each sample based on a prediction label sequence of each prediction, extracting semantic representation vectors of each sample through a BERT model coding layer, clustering the samples based on the semantic representation vectors to obtain similar sample groups, and evaluating the prediction stability of the model on each similar sample group. And fusing the matching degree, the prediction stability and the log level, constructing the labeling requirement priority of each sample, and screening target labeling requirement samples according to the labeling requirement priority. And updating the BERT model based on the target labeling requirement sample, and carrying out structured extraction on the unlabeled log through the updated model. Compared with the prior art, the method has the beneficial effects that (1) the method can effectively identify new log modes with obvious differences from the learned modes of the model by introducing the matching degree evaluation of the sample and the known log template set, and avoids repeated investment of labeling resources in the mode which is mastered. (2) The invention forms a similar sample group by clustering on the basis of quantifying the prediction stability of a single sample, and evaluates the prediction stability of a model on the whole sample group. The labeling sample selection can focus on a real cognitive blind area of the model, but not random prediction fluctuation of the surface, so that the cognitive accuracy of the model to the blind area is improved. (3) According to the method, the matching degree, the prediction stability of the samples, the prediction stability of the similar sample groups and the log level weight are fused to construct the labeling requirement priority, so that the inherent deviation of single indexes such as the confidence level of the model in the prior art is overcome, and the subsequent screening of target labeling requirement samples is more comprehensive and balanced. (4) According to the method, when the target labeling demand samples are screened, the semantic similarity between the candidate samples and the selected sample set is calculated and limited, so that the fact that the finally selected target labeling samples have enough diversity and wide range in semantic feature distribution is effectively guaranteed, and the