US-12626127-B2 - High dimensional dense tensor representation for log data

US12626127B2US 12626127 B2US12626127 B2US 12626127B2US-12626127-B2

Abstract

In some implementations, a device may obtain a training corpus, from a set of pre-processed log data, associated with an alphanumeric format. The device may encode the training corpus to obtain encoded data using a set of tokens. The device may calculate a sequence length based on a statistical parameter associated with the training corpus. The device may generate a set of input sequences and a set of target sequences based on the encoded data, where each input sequence and each target sequence has a length equal to the sequence length. The device may generate a training data set based on combining the set of input sequences and the set of target sequences. The device may train a deep neural network (DNN) using the training data set and based on one or more hyperparameters to obtain a set of embedding tensors associated with an embedding layer of the DNN.

Inventors

Sayed TAHERI
Faris MUHAMMAD
Hamed AL-RAWESHIDY
Srini CHALLA

Assignees

VIAVI SOLUTIONS INC.

Dates

Publication Date: 20260512
Application Date: 20220628

Claims (20)

1 . A method, comprising: obtaining, by a device, a training corpus from a set of concatenated pre-processed log data, wherein the training corpus is associated with an alphanumeric format; encoding, by the device, the training corpus to obtain a set of encoded data using a set of vocabulary tokens that are based on alphanumeric characters included in the training corpus, wherein the encoded data is associated with a numeric format; detecting, by the device, a set of data blocks from the training corpus based on a plurality of indicators included in the alphanumeric characters included in the training corpus, wherein the plurality of indicators indicate breaks or partitions between portions of the training corpus, and wherein each data block, of the set of data blocks, corresponds to a set of alphanumeric characters included between two indicators, of the plurality of indicators; calculating, by the device, a statistical parameter associated with the training corpus based on sizes of data blocks included in the set of data blocks; calculating, by the device, a sequence length based on the statistical parameter associated with the training corpus; generating, by the device, a set of input sequences and a set of target sequences based on the set of encoded data, wherein each input sequence, from the set of input sequences, and each target sequence, from the set of target sequences, has a length equal to the sequence length, and wherein the set of target sequences are shifted versions of the set of input sequences; generating, by the device, a training data set based on combining the set of input sequences and the set of target sequences into a tuple, partitioning the tuple into batches based on a batch size, and shuffling information included in the batches to obtain the training data set; training, by the device, a sequential deep neural network using the training data set and based on one or more hyperparameters to obtain a set of embedding tensors associated with an embedding layer of the sequential deep neural network; and performing, by the device, an artificial intelligence operation using the set of embedding tensors to obtain information associated with log data associated with the set of concatenated pre-processed log data.
2 . The method of claim 1 , further comprising: detecting one or more outlier data sets from pre-processed log data associated with the set of concatenated pre-processed log data, wherein the one or more outlier data sets are detected based on a quantity of lines associated with the one or more outlier data sets; removing the one or more outlier data sets from the pre-processed log data; and concatenating the pre-processed log data, with the one or more outlier data sets removed, to obtain the set of concatenated pre-processed log data.
3 . The method of claim 1 , wherein the set of vocabulary tokens are based on unique characters included in the alphanumeric characters that are included in the training corpus.
4 . The method of claim 1 , wherein the plurality of indicators include a plurality of command indicators, and wherein calculating the sequence length comprises: determining a size of each data block included in the set of data blocks; removing any data blocks, from the set of data blocks, for which the determined size is outside a range of sizes; and calculating the statistical parameter based on sizes of data blocks included in the set of data blocks to obtain the sequence length.
5 . The method of claim 4 , wherein the statistical parameter is an average of the sizes of data blocks included in the set of data blocks.
6 . The method of claim 1 , further comprising: detecting one or more outlier encoded data blocks from the set of encoded data based on a size of the one or more outlier encoded data blocks; and removing the one or more outlier encoded data blocks from the set of encoded data.
7 . The method of claim 1 , wherein the one or more hyperparameters include at least one of: a quantity of epochs, a size associated with the set of vocabulary tokens, an embedding dimension size, or a quantity of neurons associated with a recurrent neural network layer of the sequential deep neural network.
8 . The method of claim 1 , wherein performing the artificial intelligence operation comprises: classifying, using an artificial intelligence model that is trained based on the set of embedding tensors, the set of concatenated pre-processed log data into one or more categories from a set of candidate categories.
9 . A device, comprising: one or more memories; and one or more processors, coupled to the one or more memories, configured to: obtain a training corpus from a set of log data, wherein the training corpus includes alphanumeric characters; encode the training corpus to obtain a set of encoded data using vocabulary tokens that are based on unique characters included in the alphanumeric characters of the training corpus, wherein the set of encoded data is associated with a numeric format; detect a set of data blocks from the training corpus based on a plurality of indicators included in the alphanumeric characters included in the training corpus, wherein the plurality of indicators indicate breaks or partitions between portions of the training corpus, and wherein each data block, of the set of data blocks, corresponds to a set of alphanumeric characters included between two indicators, of the plurality of indicators; calculate a statistical parameter associated with the training corpus based on sizes of data blocks included in the set of data blocks; calculate a sequence length based on the statistical parameter associated with the training corpus; generate a set of input sequences and a set of target sequences based on the set of encoded data, wherein each input sequence, from the set of input sequences, and each target sequence, from the set of target sequences, has a length equal to the sequence length; generate a training data set based on combining the set of input sequences and the set of target sequences, and shuffling data included in the combined set of input sequences and set of target sequences to obtain the training data set; train a deep neural network using the training data set and based on one or more hyperparameters to obtain a set of embedding tensors associated with an embedding layer of the deep neural network, wherein the deep neural network includes the embedding layer, a recurrent neural network layer, and a dense neural network layer; and perform an artificial intelligence operation using the set of embedding tensors to obtain information associated with the set of log data.
10 . The device of claim 9 , wherein the one or more processors are further configured to: detect one or more outlier data sets from the set of log data based on a size of the one or more outlier data sets; and remove the one or more outlier data sets from the set of log data.
11 . The device of claim 9 , wherein the one or more processors, to generate the set of input sequences and the set of target sequences, are configured to: generate the set of input sequences based on partitioning the set of encoded data into sequences having the sequence length; and apply a window shift, based on a shift value, to the set of input sequences to generate the set of target sequences.
12 . The device of claim 9 , wherein the plurality of indicators include a plurality of command indicators, and wherein the one or more processors, to calculate the sequence length, are configured to: determine a size of each data block included in the set of data blocks; remove any data blocks, from the set of data blocks, that for which the determined size is outside a range of sizes; and calculate the statistical parameter based on sizes of data blocks included in the set of data blocks to obtain the sequence length.
13 . The device of claim 9 , wherein the embedding layer is a first layer of the deep neural network, the recurrent neural network layer is a second layer of the deep neural network, and the dense neural network layer is a third layer of the deep neural network, and wherein the recurrent neural network layer includes at least one of a long short-term memory recurrent neural network layer or a gated recurrent unit recurrent neural network layer.
14 . The device of claim 9 , wherein the one or more processors are further configured to: receive an indication of respective values for the one or more hyperparameters, wherein the one or more hyperparameters include at least one of: a quantity of epochs associated with training the deep neural network, a size associated with the vocabulary tokens, an embedding dimension size associated with the embedding layer, or a quantity of neurons associated with the recurrent neural network layer.
15 . The device of claim 9 , wherein the one or more processors, to perform the artificial intelligence operation, are configured to: perform, using an artificial intelligence model that is trained based on using the set of embedding tensors as an input to the artificial intelligence model, at least one of: a categorization of the set of log data, or a similarity analysis of the set of log data to one or more other sets of log data.
16 . A non-transitory computer-readable medium storing a set of instructions, the set of instructions comprising: one or more instructions that, when executed by one or more processors of a device, cause the device to: detect one or more outlier data sets from pre-processed log data and remove the one or more outlier data sets from the pre-processed log data; obtain a training corpus based on concatenating the pre-processed log data, wherein the training corpus includes alphanumeric characters; encode the training corpus to obtain a set of encoded data using a set of vocabulary tokens that are based on unique characters included in the alphanumeric characters, wherein the encoded data is associated with a numeric format; detect a set of data blocks from the training corpus based on a plurality of indicators included in the alphanumeric characters included in the training corpus, wherein the plurality of indicators indicate breaks or partitions between portions of the training corpus, and wherein each data block, of the set of data blocks, corresponds to a set of alphanumeric characters included between two indicators, of the plurality of indicators; calculate a statistical parameter associated with the training corpus based on sizes of data blocks included in the set of data blocks; calculate a sequence length based on the statistical parameter associated with the training corpus; generate a set of input sequences and a set of target sequences based on the set of encoded data, wherein each input sequence, from the set of input sequences, and each target sequence, from the set of target sequences, have a length equal to the sequence length; generate a training data set based on combining the set of input sequences and the set of target sequences into a tuple, partitioning the tuple into batches based on a batch size, and shuffling information included in the batches to obtain the training data set; and train a sequential deep neural network using the training data set and based on one or more hyperparameters to obtain a set of embedding tensors associated with an embedding layer of the sequential deep neural network, wherein the set of embedding tensors is a numerical representation of the pre-processed log data.
17 . The non-transitory computer-readable medium of claim 16 , wherein the one or more instructions further cause the device to: perform an artificial intelligence operation using the set of embedding tensors to obtain information associated with the pre-processed log data.
18 . The non-transitory computer-readable medium of claim 16 , wherein the plurality of indicators include a plurality of command indicators, and wherein the one or more instructions, that cause the device to calculate the sequence length, cause the device to: determine a size of each data block included in the set of data blocks; and calculate the statistical parameter based on sizes of data blocks included in the set of data blocks to obtain the sequence length.
19 . The non-transitory computer-readable medium of claim 16 , wherein the sequential deep neural network includes the embedding layer as a first layer, a recurrent neural network layer as a second layer, and a dense neural network layer as a third layer, and wherein the embedding layer is associated with input hyperparameters, from the one or more hyperparameters, including a size associated with the set of vocabulary tokens, the batch size, and an embedding dimension size.
20 . The non-transitory computer-readable medium of claim 19 , wherein the recurrent neural network layer is associated with input hyperparameters, from the one or more hyperparameters, including a quantity of neurons or hidden units, and a recurrent initializer, and wherein the dense neural network layer is associated with input hyperparameters, from the one or more hyperparameters, including the size associated with the set of vocabulary tokens.

Description

BACKGROUND Artificial neural networks, sometime referred to as neural networks (NNs), are computing systems inspired by the biological neural networks associated with a biological brain. An NN is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, similar to the synapses in a biological brain, can support a transmission of a signal to other neurons. An artificial neuron may receive a signal, processes the signal, and/or transmit the signal to other neurons. The “signal” at a connection is a real number, and the output of each neuron is computed by some non-linear function of the sum of its inputs. The connections may be referred to as edges. Neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Neurons may have a threshold such that a signal is sent only if the aggregate signal crosses that threshold. Typically, neurons are aggregated into layers. Different layers may perform different transformations on their inputs. Signals may travel from the first layer (the input layer) to the last layer (the output layer) (e.g., possibly after traversing the layers multiple times). SUMMARY Some implementations described herein relate to a method. The method may include obtaining, by a device, a training corpus from a set of concatenated pre-processed log data, wherein the training corpus is associated with an alphanumeric format. The method may include encoding, by the device, the training corpus to obtain a set of encoded data using a set of vocabulary tokens that are based on alphanumeric characters included in the training corpus, wherein the encoded data is associated with a numeric format. The method may include calculating, by the device, a sequence length based on a statistical parameter associated with the training corpus. The method may include generating, by the device, a set of input sequences and a set of target sequences based on the set of encoded data, wherein each input sequence, from the set of input sequences, and each target sequence, from the set of target sequences, has a length equal to the sequence length, and wherein the set of target sequences are shifted versions of the set of input sequences. The method may include generating, by the device, a training data set based on combining the set of input sequences and the set of target sequences into a tuple, partitioning the tuple into batches based on a batch size, and shuffling information included in the batches to obtain the training data set. The method may include training, by the device, a sequential deep neural network using the training data set and based on one or more hyperparameters to obtain a set of embedding tensors associated with an embedding layer of the sequential deep neural network. The method may include performing, by the device, an artificial intelligence operation using the set of embedding tensors to obtain information associated with log data associated with the set of concatenated pre-processed log data. Some implementations described herein relate to a device. The device may include one or more memories and one or more processors coupled to the one or more memories. The one or more processors may be configured to obtain a training corpus from a set of log data, wherein the training corpus includes alphanumeric characters. The one or more processors may be configured to encode the training corpus to obtain a set of encoded data using vocabulary tokens that are based on unique characters included in the alphanumeric characters of the training corpus, wherein the encoded data is associated with a numeric format. The one or more processors may be configured to calculate a sequence length based on a statistical parameter associated with the training corpus. The one or more processors may be configured to generate a set of input sequences and a set of target sequences based on the set of encoded data, wherein each input sequence, from the set of input sequences, and each target sequence, from the set of target sequences, has a length equal to the sequence length. The one or more processors may be configured to generate a training data set based on combining the set of input sequences and the set of target sequences, and shuffling data included in the combined set of input sequences and set of target sequences to obtain the training data set. The one or more processors may be configured to train a deep neural network using the training data set and based on one or more hyperparameters to obtain a set of embedding tensors associated with an embedding layer of the deep neural network, wherein the deep neural network includes the embedding layer, a recurrent neural network layer, and a dense neural network layer. The one or more processors may be configured to perform an artificial intelligence operation using the set of embedding tens