CN-122020529-A - Time sequence data synthesis method, device and storage medium

CN122020529ACN 122020529 ACN122020529 ACN 122020529ACN-122020529-A

Abstract

The application provides a time sequence data synthesis method, a time sequence data synthesis device and a storage medium, which relate to the technical field of data processing and can improve the accuracy of time sequence data synthesis results. The method comprises the steps of determining a plurality of patch sequences based on a preset window and input data, wherein the input data comprises a plurality of data sequences and category labels of each data sequence in the plurality of data sequences, the plurality of patch sequences are obtained by intercepting each data sequence through the preset window, determining state parameters of each data sequence under each category label based on each patch sequence and a category condition half Markov rule, and determining a synthetic data sequence of the input data based on the state parameters and the input data, wherein the state parameters comprise at least one of initial state distribution, transition probability matrix and state duration distribution.

Inventors

LI JINBO
Jia ao
LIANG PENG
ZHOU XIAOLONG
JING LEI
LI XIANG
FAN BIN
DI QINGYUE
ZHAO ZHICHENG
LIU XING

Assignees

中国联合网络通信集团有限公司

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (10)

1. A method of time series data synthesis, the method comprising: determining a plurality of patch sequences based on a preset window and input data, wherein the input data comprises a plurality of data sequences and category labels of each data sequence in the plurality of data sequences, and the plurality of patch sequences are obtained by intercepting each data sequence through the preset window; Determining state parameters of each data sequence under each category label based on each patch sequence and the category condition semi-Markov rule, wherein the state parameters comprise at least one of initial state distribution, transition probability matrix and state duration distribution; based on the state parameter and the input data, a composite data sequence of the input data is determined.
2. The method of claim 1, wherein determining the state parameter of each data sequence under each category label based on each patch sequence and the class-conditional semi-markov rule comprises: Inputting each patch sequence into an embedded extraction model, and determining the embedded characteristics of each patch sequence; clustering the embedded features of each patch sequence to determine a state label of each patch sequence, wherein the state label is used for indicating the state of data in the corresponding patch sequence; And determining the state parameters of each data sequence under each category label based on the state label of each patch sequence and the category condition semi-Markov rule.
3. The method of claim 2, wherein the clustering the embedded features of each patch sequence to determine the status tag of each patch sequence comprises: Clustering the embedded features of each patch sequence, and determining the membership degree between each patch sequence and each cluster center in at least two cluster centers; and determining the cluster center with the largest membership degree among the at least two cluster centers and between each patch sequence as the state label of each patch sequence.
4. The method of claim 2, wherein determining the state parameter for each data sequence under each category label based on the state label and the category-conditional semi-markov rule for each patch sequence comprises: Determining a state label sequence of the first class label sign based on the state label of each patch sequence under the first class label and the position of each patch sequence in the corresponding data sequence, wherein the first class label is one class label in the plurality of class labels, and the state label sequence is a sequence formed by the state labels; Inputting a first patch sequence of each data sequence under the first category label and a state label of the first patch sequence into a starting state distribution equation, determining a starting state distribution under the first category label, and/or, Based on adjacent state tags in the first class label sign state label sequence, a transition probability matrix under the first class label is determined, and/or, The state duration distribution under the first type of tag is determined based on the run length of each state tag in the sequence of state tags, the run length being indicative of the number of patch sequences that occur consecutively for each state tag.
5. The method of claim 1, wherein the determining a composite data sequence of the input data based on the state parameter and the input data comprises: determining a start state tag of the synthetic data sequence based on the start state distribution and the input data, and generating a first patch sequence of the synthetic data sequence based on the start state tag; Determining the number of continuous patch sequences of the starting state label based on the state duration distribution and the input data, and generating a plurality of patch sequences based on the number of continuous patch sequences and the starting state label; determining a state label of a next patch sequence of the patch sequences based on the transition probability matrix and the input data, and determining the next patch sequence of the patch sequences based on the state label of the next patch sequence until the number of generated patch sequences reaches a preset number; And merging the patch sequences with the preset quantity to determine a synthetic data sequence of the input data.
6. The method according to claim 1, wherein the method further comprises: Acquiring initial input data; Based on the mean value and standard deviation of each data sequence in the initial input data, normalizing the initial input data, and determining normalized initial input data; And determining the normalized initial input data as the input data.
7. The method of claim 6, wherein the normalized initial input data satisfies the following formula: Wherein, the For the nth data in the nth data sequence of the normalized initial input data, The nth data of the nth data sequence for the initial input data, For the average value of the nth data sequence of the initial input data, For the standard deviation of the nth data sequence of the initial input data, t and n are positive integers, Is a constant other than 0.
8. A time series data synthesizing device is characterized in that the time series data synthesizing device comprises a determining unit; The determining unit is used for determining a plurality of patch sequences based on a preset window and input data, wherein the input data comprises a plurality of data sequences and category labels of each data sequence in the plurality of data sequences, and the plurality of patch sequences are obtained by intercepting each data sequence through the preset window; the determining unit is further configured to determine, based on the patch sequences and the class condition semi-markov rule, a state parameter of each data sequence under each class label, where the state parameter includes at least one of a start state distribution, a transition probability matrix, and a state duration distribution; the determining unit is further configured to determine a synthetic data sequence of the input data based on the state parameter and the input data.
9. A time series data synthesis apparatus comprising a processor and a communication interface, the communication interface being coupled to the processor, the processor being operable to execute a computer program or instructions to implement the method of any one of claims 1 to 7.
10. A computer readable storage medium having instructions stored therein, characterized in that, when the instructions are executed by a computer, the computer performs the method of any of the preceding claims 1-7.

Description

Time sequence data synthesis method, device and storage medium Technical Field The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for synthesizing time-series data, and a storage medium. Background With the rapid development of artificial intelligence, more and more business scenarios rely on large amounts of data to support model training, testing and verification. The source of the data is mainly real scene acquisition or artificial synthesis. Because the data acquired by the real scene often has the problems of less data volume, uncoordinated distribution or private data leakage, the application of providing the synthesized data for model training, testing and verification through a reasonable data synthesis method has important significance. At present, the existing time sequence data synthesis method mainly carries out time sequence data synthesis through a statistical modeling method and a mathematical modeling method, the method requires that input data have the characteristics of strong stationarity, linear relation or specific distribution, and when the input data have obvious nonlinearity, structural mutation, long-range dependence or multi-scale dynamic characteristics, the existing time sequence data synthesis method has the problem of inaccurate time sequence data synthesis result. Disclosure of Invention The application provides a time sequence data synthesis method, a time sequence data synthesis device and a storage medium, which can improve the accuracy of time sequence data synthesis results. In order to achieve the above purpose, the application adopts the following technical scheme: The application provides a time sequence data synthesis method, which comprises the steps of determining a plurality of patch sequences based on a preset window and input data, wherein the input data comprises a plurality of data sequences and category labels of each data sequence in the plurality of data sequences, the plurality of patch sequences are obtained by intercepting each data sequence through the preset window, determining state parameters of each data sequence under each category label based on each patch sequence and a category condition semi-Markov rule, and determining a synthesized data sequence of the input data based on the state parameters and the input data, wherein the state parameters comprise at least one of initial state distribution, transition probability matrix and state duration distribution. The technical scheme at least has the advantages that the input data with large data volume is segmented into the patch sequences, the global characteristics of the input data are determined by analyzing the characteristics of the patch sequences, and the characteristic distribution of the input data can be accurately determined under the condition that the input data are not required to have specific distribution, so that accurate time sequence data are synthesized. In addition, through respectively determining the characteristic distribution of the data sequences under the different types of tags, the interference of the data of the different types of tags can be reduced, and the accuracy of the time sequence data synthesis result is further improved. In one possible implementation, the method for determining the state parameters of each data sequence under each type of tag based on each patch sequence and the quasi-conditional semi-Markov rule comprises the steps of inputting each patch sequence into an embedding extraction model, determining the embedding characteristics of each patch sequence, clustering the embedding characteristics of each patch sequence, determining the state tag of each patch sequence, wherein the state tag is used for indicating the state of data in the corresponding patch sequence, and determining the state parameters of each data sequence under each type of tag based on the state tag of each patch sequence and the quasi-conditional semi-Markov rule. In one possible implementation, clustering the embedded features of each patch sequence to determine a status tag of each patch sequence includes clustering the embedded features of each patch sequence, determining a membership degree between each patch sequence and each of at least two cluster centers, and determining a cluster center with the largest membership degree among the at least two cluster centers and between each patch sequence as the status tag of each patch sequence. In one possible implementation, the determining the state parameter of each data sequence under each patch sequence based on the state label of each patch sequence and the class condition semi-Markov rule includes determining a state label sequence of a first class label sign based on the state label of each patch sequence under the first class label and the position of each patch sequence in the corresponding data sequence, the first class label being one of a plurality o