CN-121997148-A - Data set construction method and system based on big data

CN121997148ACN 121997148 ACN121997148 ACN 121997148ACN-121997148-A

Abstract

The application relates to the field of data processing, in particular to a data set construction method and system based on big data. The method comprises the steps of constructing an initial data set containing normal samples and fault samples, decomposing time sequence sequences in the initial data set to obtain a plurality of frequency band sequences, extracting time-frequency domain features, calculating frequency band sensitivity degrees of the fault samples and the normal samples on each frequency band based on the time-frequency domain features, constructing an association matrix, calculating frequency band comprehensive weights by combining the frequency band sensitivity degrees and the association matrix, constructing a loss function of a generator by utilizing the frequency band comprehensive weights, constructing an countermeasure network by utilizing the construction condition, generating a synthesized frequency band sequence by utilizing the trained condition, carrying out inverse transformation reconstruction on the synthesized frequency band sequence, obtaining the synthesized samples, and mixing the synthesized frequency band sequence with the fault samples in the initial data set to construct an enhancement data set. The application has the effect of improving the prediction accuracy of the LSTM model by using the enhanced data.

Inventors

WU YANG
HU ZHI
LIU JIE
LAN WEIJIE

Assignees

嘉杰科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260409

Claims (10)

1. The data set construction method based on big data is characterized by comprising the following steps: Acquiring historical monitoring data and constructing an initial data set containing normal samples and fault samples; decomposing the time sequence in the initial data set to obtain a plurality of frequency band sequences, and extracting the time-frequency domain characteristics of each frequency band sequence; Calculating the frequency band sensitivity degree of the fault sample and the normal sample on each frequency band based on the time-frequency domain characteristics; Constructing a loss function of a generator by using frequency band comprehensive weights, and generating the countermeasure network based on the loss function training conditions, wherein the loss function of the generator comprises countermeasure loss and weight constraint loss, the weight constraint loss is used for weighting and calculating errors between a generated sequence and a frequency band sequence corresponding to a fault sample in an initial dataset, and the weight coefficient is the frequency band comprehensive weights; And generating a synthetic frequency band sequence by using the trained condition generation countermeasure network, carrying out inverse transformation reconstruction on the synthetic frequency band sequence to obtain a synthetic sample, and mixing the synthetic sample with a fault sample in the initial data set to construct an enhanced data set.
2. The method for constructing a data set based on big data according to claim 1, wherein the method for calculating the frequency band sensitivity is as follows: calculating the mean value and standard deviation of each characteristic value in the time-frequency domain characteristic based on the normal sample set; Carrying out standardization processing on the feature vector of the fault sample by using the mean value and the standard deviation to obtain a standardized feature vector; and calculating the Euclidean distance from the standardized feature vector to the center of the standardized normal sample, solving the average value of the Euclidean distances of all the fault samples, and taking the obtained average value as the frequency band sensitivity degree.
3. The method according to claim 1, wherein the decomposing means is a wavelet transform, and the time series in the initial data set is decomposed to obtain a plurality of frequency band sequences.
4. A method of constructing a big data based dataset as claimed in claim 3 wherein the weight constraint penalty is expressed as: wherein, the method comprises the steps of, Representing the weight constraint loss of the weight, Is a frequency band Is used for the combination of the weights of (1), Frequency band of generator output Is provided with a normalized wavelet coefficient sequence of (c), Frequency bands corresponding to failure samples in the initial dataset Is provided with a normalized wavelet coefficient sequence of (c), Representing the square of the L2 norm.
5. The method of claim 3, wherein the extracting the time-frequency domain features of each frequency band sequence comprises calculating frequency band energy, energy trend, spectral kurtosis, peak factor and frequency band entropy, wherein the frequency band energy is the sum of squares of wavelet coefficient amplitudes, and the energy trend is a linear fit slope of the wavelet coefficient amplitude sequences.
6. A method of constructing a big data based dataset as claimed in claim 3, wherein the constructing an association matrix comprises the steps of: extracting wavelet coefficient amplitude sequences of each fault sample in each frequency band; For any two frequency bands, respectively calculating sample-level correlation coefficients of corresponding wavelet coefficient amplitude sequences in each fault sample, and averaging the sample-level correlation coefficients of all the fault samples to obtain a global correlation coefficient between the two frequency bands; and constructing a symmetrical incidence matrix based on the global correlation coefficient between each frequency band pair, wherein the diagonal line element of the incidence matrix is 1, and the non-diagonal line element is the global correlation coefficient of the corresponding frequency band pair.
7. The big data based dataset construction method of claim 1, wherein constructing the enhanced dataset includes: Carrying out continuous wavelet inverse transformation reconstruction on the synthesized frequency band sequence to obtain a synthesized sample; Mixing the synthesized sample and the fault sample in the initial data set according to a preset proportion to construct an enhanced fault training set; and reserving a normal sample subset in the initial data set, wherein the enhanced data set is an enhanced fault training set.
8. The method for constructing a large data-based dataset as claimed in claim 1, wherein, The construction condition generation countermeasure network comprises a generator input comprising a random noise vector, a frequency band coding vector and a fault label, a discriminator input comprising a wavelet coefficient sequence, a frequency band coding vector and a fault label, wherein the frequency band coding vector is a one-hot code which uniquely identifies different frequency bands.
9. The method for constructing the data set based on big data according to claim 1, wherein constructing the initial data set containing the normal samples and the fault samples comprises performing missing value linear interpolation or sample rejection on the historical monitoring data, replacing amplitude outliers by a sliding window median filter, smoothing high-frequency acquisition noise by a first-order low-pass filter, and performing Z-Score normalization based on the mean value and the standard deviation of the historical normal sample set.
10. A big data based dataset construction system, comprising a processor and a memory, the memory storing computer program instructions which, when executed by the processor, implement the big data based dataset construction method according to any of claims 1-9.

Description

Data set construction method and system based on big data Technical Field The application relates to the field of data processing, in particular to a data set construction method and system based on big data. Background In a monitoring scene of a server cluster, a virtual machine or a container cloud platform, the normal operation time of the system is far longer than the fault occurrence time, so that fault samples are scarce, and a fault prediction model based on deep learning such as a long short term memory network (LSTM) is insufficient in training, and the prediction effect is poor. In the prior art, a data enhancement method (e.g., SMOTE) is often used to increase the fault samples, however, such a method mostly performs linear interpolation or random noise addition in a time domain or a feature space, and ignores the physical distribution rule of time sequence monitoring data in a frequency domain and the energy cooperative correlation between different frequency bands. The generated fault samples are increased in number, but lack of real fault physical characteristics, do not accord with the actual fault propagation rule, and further the generalization capability of the trained LSTM model is weak, and the prediction performance is reduced. Disclosure of Invention In order to solve the technical problems, the application provides a data set construction method and system based on big data. In a first aspect, the present application provides a data set construction method based on big data, which adopts the following technical scheme: A data set construction method based on big data comprises the steps of obtaining historical monitoring data, constructing an initial data set containing normal samples and fault samples, decomposing time sequence sequences in the initial data set to obtain a plurality of frequency band sequences, extracting time-frequency domain characteristics of each frequency band sequence, calculating frequency band sensitivity degrees of the fault samples and the normal samples on each frequency band based on the time-frequency domain characteristics, constructing an association matrix, calculating frequency band comprehensive weights by combining the frequency band sensitivity degrees with the association matrix, constructing a loss function of a generator by using the frequency band comprehensive weights, and constructing the countermeasure network based on the loss function training condition, wherein the loss function of the generator comprises countermeasure loss and weight constraint loss, the weight constraint loss is used for weighting errors between frequency band sequences corresponding to the fault samples in the initial data set, the weight coefficient is the frequency band comprehensive weights, generating a synthesized frequency band sequence by using the trained condition, carrying out inverse transformation reconstruction on the synthesized frequency band sequence to obtain the synthesized samples, and mixing the synthesized frequency band sequence with the fault samples in the initial data set to construct an enhanced data set. The method for calculating the frequency band sensitivity comprises the steps of calculating the mean value and standard deviation of each characteristic value in time-frequency domain characteristics based on a normal sample set, carrying out standardization processing on the characteristic vectors of the fault samples by using the mean value and the standard deviation to obtain standardized characteristic vectors, calculating Euclidean distance from the standardized characteristic vectors to the center of the normalized normal samples, and calculating the mean value of Euclidean distances of all the fault samples, wherein the obtained mean value is used as the frequency band sensitivity. Optionally, the decomposing method is wavelet transformation in the time sequence in the initial data set to obtain a plurality of frequency band sequences. Optionally, the weight constraint loss is expressed as: wherein, the method comprises the steps of, Representing the weight constraint loss of the weight,Is a frequency bandIs used for the combination of the weights of (1),Frequency band of generator outputIs provided with a normalized wavelet coefficient sequence of (c),Frequency bands corresponding to failure samples in the initial datasetIs provided with a normalized wavelet coefficient sequence of (c),Representing the square of the L2 norm. Optionally, the extracting the time-frequency domain features of each frequency band sequence includes calculating frequency band energy, energy trend, spectral kurtosis, peak factor and frequency band entropy, wherein the frequency band energy is the square sum of wavelet coefficient amplitude, and the energy trend is the linear fitting slope of the wavelet coefficient amplitude sequence. Optionally, the construction of the correlation matrix comprises the steps of extracting wavelet coefficient amplitude sequ