US-12619775-B2 - Method, device and computer program product for data simulation based on generated data pattern
Abstract
According to example embodiments of the present disclosure, a method, device and computer program product for data simulation are proposed. The method for data simulation includes: obtaining first data pattern information that is associated with a first set of operations executed on real data in a data protection system; generating, based on the first data pattern information, second data pattern information that is associated with a second set of operations executable by the data protection system; and generating, based on the second data pattern information, simulation data different from the real data, for the data protection system to execute the second set of operations on the simulation data. Thereby, the present solution can simulate efficiently and reliably a data pattern of real data, and thus generating simulation data of a data pattern similar to that of the real data.
Inventors
- Aaron Chao Lin
- Simon Yuting Zhang
Assignees
- EMC IP Holding Company LLC
Dates
- Publication Date
- 20260505
- Application Date
- 20200505
- Priority Date
- 20191213
Claims (18)
- 1 . A method of using a generative adversarial network (GAN) to improve simulation data of a data protection system, comprising: collecting a plurality of sets of historical data pattern information associated with real training data in the data protection system, each of the plurality of sets of historical data pattern information comprising a respective set of operations executed on a respective portion of the real training data in the data protection system, each of the respective sets of operations comprising a respective set of values for operation parameters collected during performance of the respective set of operations on the respective portion of the real training data, and each respective set of values includes a respective data pattern of the respective portion of the real training data; applying a first label to each set of historical data pattern information the first label indicating the plurality of sets of historical data pattern information are real; providing the historical data pattern information to the GAN for training of the GAN, wherein the GAN includes a generator comprising a first neural network and a discriminator comprising a second neural network, wherein training the GAN includes: generating, via the generator, a plurality of sets of simulated data pattern information based on random noise, each of the plurality of sets of simulated data pattern information generated with a first label indicating the sets of simulated data pattern information are simulated, providing a mix of the sets of historical data pattern information and the sets of simulated data pattern information to the discriminator, producing a result of discrimination based on classification, via the discriminator, of each set of the mix of the sets of historical data pattern information and the sets of simulated data pattern information to the discriminator, and creating a trained generator of the GAN and a trained discriminator of the GAN based on the result of discrimination, wherein the trained discriminator is unable to discriminate simulated data pattern information generated by the trained generator from the historical data pattern information; collecting, by a processor, first data pattern information that is associated with a first set of operations executed on real data in the data protection system, the first data pattern information comprising values for operation parameters collected during performance of the first set of operations in the data protection system, and the first data pattern information including a data pattern of the real data; generating, by the processor via the trained generator of the GAN, second data pattern information that simulates the first data pattern information and reflects the data pattern of the real data, the second data pattern information associated with a second set of operations executable by the data protection system utilizing the GAN, the second set of operations being different than the first set of operations, and wherein the generating, by the processor, based on the first data pattern information, the second data pattern information comprising: applying the first data pattern information to the trained generator of the GAN; to generate the second data pattern information as a simulation of the first data pattern information; and converting, by the processor, the second data pattern information into simulation data different from the real data, the simulation data reflecting the data pattern of the real data that the first set of operations were executed on in the data protection system, and the simulation data for the data protection system to execute the second set of operations on the simulation data.
- 2 . The method of claim 1 , wherein the first set of operations and the second set of operations include at least one of the following, respectively: a deduplication operation; a write operation; or a synthesis operation.
- 3 . The method of claim 1 , wherein obtaining the first data pattern information comprises: obtaining a value of an operation parameter applied in the first set of operations; and generating, based on the value of the operation parameter, the first data pattern information.
- 4 . The method of claim 3 , wherein obtaining the value of the operation parameter comprises: sorting each operation in the first set of operations based on execution time; and obtaining the value of the operation parameter for each operation in the sorted first set of operations.
- 5 . The method of claim 3 , wherein obtaining the value of the operation parameter comprising obtaining a value of at least one of the following: a pre-deduplication size, a post-deduplication size, a pre-compression size, a post-compression size, a number of segments, network bytes, a number of write requests, a write size, a number of write regions, write region statuses, a write offset, write bytes per second, a number of synthesis requests, a synthesis size, a number of synthesis regions, synthesis region statuses, a synthesis offset, and synthesis bytes per second.
- 6 . The method of claim 1 , wherein generating the second data pattern comprises: executing, based on a specified classification criterion, classification on the first data pattern information; and generating, based on a classification result of the first data pattern information, the second data information from the first data pattern information.
- 7 . An electronic device, comprising: at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions, which when executed by the at least one processing unit, causing the at least one processing unit to perform operations, the operations comprising: collecting a plurality of sets of historical data pattern information associated with real training data in the data protection system, each of the plurality of sets of historical data pattern information comprising a respective set of operations executed on a respective portion of the real training data in the data protection system, each of the respective sets of operations comprising a respective set of values for operation parameters collected during performance of the respective set of operations on the respective portion of the real training data, and each respective set of values includes a respective data pattern of the respective portion of the real training data; applying a first label to each set of historical data pattern information the first label indicating the plurality of sets of historical data pattern information are real; providing the historical data pattern information to the GAN for training of the GAN, wherein the GAN includes a generator comprising a first neural network and a discriminator comprising a second neural network, wherein training the GAN includes: generating, via the generator, a plurality of sets of simulated data pattern information based on random noise, each of the plurality of sets of simulated data pattern information generated with a first label indicating the sets of simulated data pattern information are simulated, providing a mix of the sets of historical data pattern information and the sets of simulated data pattern information to the discriminator, producing a result of discrimination based on classification, via the discriminator, of each set of the mix of the sets of historical data pattern information and the sets of simulated data pattern information to the discriminator, and creating a trained generator of the GAN and a trained discriminator of the GAN based on the result of discrimination, wherein the trained discriminator is unable to discriminate simulated data pattern information generated by the trained generator from the historical data pattern information; collecting first data pattern information that is associated with a first set of operations executed on real data in the data protection system, the first data pattern information comprising values for operation parameters collected during performance of the first set of operations in the data protection system, and the first data pattern information including a data pattern of the real data; generating, via the trained generator of the GAN, second data pattern information that simulates the first data pattern information and reflects the data pattern of the real data, the second data pattern information associated with a second set of operations executable by the data protection system utilizing the GAN, the second set of operations being different than the first set of operations, and wherein the generating based on the first data pattern information, the second data pattern information comprising: applying the first data pattern information to the trained generator of the GAN; to generate the second data pattern information as a simulation of the first data pattern information; converting the second data pattern information into simulation data different from the real data, the simulation data reflecting the data pattern of the real data that the first set of operations were executed on in the data protection system, and the simulation data for the data protection system to execute the second set of operations on the simulation data.
- 8 . The device of claim 7 , wherein the first set of operations and the second set of operations include at least one of the following, respectively: a deduplication operation; a write operation; or a synthesis operation.
- 9 . The device of claim 7 , wherein obtaining the first data pattern information comprises: obtaining a value of an operation parameter applied in the first set of operations; and generating, based on the value of the operation parameter, the first data pattern information.
- 10 . The device of claim 9 , wherein obtaining the value of the operation parameter comprises: sorting each operation in the first set of operations based on execution time; and obtaining the value of the operation parameter for each operation in the sorted first set of operations.
- 11 . The device of claim 9 , wherein obtaining the value of the operation parameter comprising obtaining a value of at least one of the following: a pre-deduplication size, a post-deduplication size, a pre-compression size, a post-compression size, a number of segments, network bytes, a number of write requests, a write size, a number of write regions, write region statuses, a write offset, write bytes per second, a number of synthesis requests, a synthesis size, a number of synthesis regions, synthesis region statuses, a synthesis offset, and synthesis bytes per second.
- 12 . The device of claim 7 , wherein generating the second data pattern comprises: executing, based on a specified classification criterion, classification on the first data pattern information; and generating, based on a classification result of the first data pattern information, the second data information from the first data pattern information.
- 13 . A computer program product tangibly stored on a non-transitory computer-readable medium and including machine-executable instructions, which when executed by a machine, cause the machine to: collecting a plurality of sets of historical data pattern information associated with real training data in the data protection system, each of the plurality of sets of historical data pattern information comprising a respective set of operations executed on a respective portion of the real training data in the data protection system, each of the respective sets of operations comprising a respective set of values for operation parameters collected during performance of the respective set of operations on the respective portion of the real training data, and each respective set of values includes a respective data pattern of the respective portion of the real training data; applying a first label to each set of historical data pattern information the first label indicating the plurality of sets of historical data pattern information are real; providing the historical data pattern information to the GAN for training of the GAN, wherein the GAN includes a generator comprising a first neural network and a discriminator comprising a second neural network, wherein training the GAN includes: generating, via the generator, a plurality of sets of simulated data pattern information based on random noise, each of the plurality of sets of simulated data pattern information generated with a first label indicating the sets of simulated data pattern information are simulated, providing a mix of the sets of historical data pattern information and the sets of simulated data pattern information to the discriminator, producing a result of discrimination based on classification, via the discriminator, of each set of the mix of the sets of historical data pattern information and the sets of simulated data pattern information to the discriminator, and creating a trained generator of the GAN and a trained discriminator of the GAN based on the result of discrimination, wherein the trained discriminator is unable to discriminate simulated data pattern information generated by the trained generator from the historical data pattern information; collecting first data pattern information that is associated with a first set of operations executed on real data in the data protection system, the first data pattern information comprising values for operation parameters collected during performance of the first set of operations in the data protection system, and the first data pattern information including a data pattern of the real data; generating, via the trained generator of the GAN, second data pattern information that simulates the first data pattern information and reflects the data pattern of the real data, the second data pattern information associated with a second set of operations executable by the data protection system utilizing the GAN, the second set of operations being different than the first set of operations, and wherein the generating based on the first data pattern information, the second data pattern information comprising: applying the first data pattern information to the trained generator of the GAN; to generate the second data pattern information as a simulation of the first data pattern information; converting the second data pattern information into simulation data different from the real data, the simulation data reflecting the data pattern of the real data that the first set of operations were executed on in the data protection system, and the simulation data for the data protection system to execute the second set of operations on the simulation data.
- 14 . The computer program product of claim 13 , wherein the first set of operations and the second set of operations include at least one of the following, respectively: a deduplication operation; a write operation; or a synthesis operation.
- 15 . The computer program product of claim 13 , wherein obtaining the first data pattern information comprises: obtaining a value of an operation parameter applied in the first set of operations; and generating, based on the value of the operation parameter, the first data pattern information.
- 16 . The computer program product of claim 15 , wherein obtaining the value of the operation parameter comprising obtaining a value of at least one of the following: a pre-deduplication size, a post-deduplication size, a pre-compression size, a post-compression size, a number of segments, network bytes, a number of write requests, a write size, a number of write regions, write region statuses, a write offset, write bytes per second, a number of synthesis requests, a synthesis size, a number of synthesis regions, synthesis region statuses, a synthesis offset, and synthesis bytes per second.
- 17 . The computer program product of claim 13 , wherein generating the second data pattern comprises: executing, based on a specified classification criterion, classification on the first data pattern information; and generating, based on a classification result of the first data pattern information, the second data information from the first data pattern information.
- 18 . The method of claim 1 , further comprising: classifying the first data pattern information based on a user industry or a current data protection process to produce a classification result; selecting the updated generator from a plurality of generators based on the classification result, wherein the plurality of generators corresponds to a plurality of trainings of the GAN.
Description
CROSS-REFERENCE TO RELATED APPLICATION This patent application claims priority, under 35 U.S.C. § 119, of Chinese Patent application Ser. No. 20/191,1286175.2, filed Dec. 13, 2019, which is incorporated by reference in its entirety. FIELD Embodiments of the present disclosure generally relate to computer technologies, and more specifically, to a method, device and computer program product for data simulation. BACKGROUND A data pattern is of great importance to a data protection system, which reflects operations performed on data in the data protection system. For example, in a stress testing, the Quality Assurance team needs a data pattern similar to the user scenario to verify the performance of the data protection system. In addition, the support/sales team needs the data pattern to compare the data protection system with competitors' data protection systems, to prove the advantages of its data protection system. Therefore, the data pattern needs to be obtained efficiently and reliably. SUMMARY Embodiments of the present disclosure provide a method, device and computer program product for data simulation. In a first aspect, a method for data simulation is proposed. The method comprises: obtaining first data pattern information that is associated with a first set of operations executed on real data in a data protection system; generating, based on the first data pattern information, second data pattern information that is associated with a second set of operations executable by the data protection system; and generating, based on the second data pattern information, simulation data different from the real data, for the data protection system to execute the second set of operations on the simulation data. In a second aspect of the present disclosure, an electronic device is proposed. The device comprises at least one processing unit and at least one memory. The at least one memory is coupled to the at least one processing unit and stores instructions executed by the at least one processing unit. The instructions, when executed by the at least one processing unit, cause the device to execute acts comprising: obtaining first data pattern information that is associated with a first set of operations executed on real data in a data protection system; generating, based on the first data pattern information, second data pattern information that is associated with a second set of operations executable by the data protection system; and generating, based on the second data pattern information, simulation data different from the real data, for the data protection system to execute the second set of operations on the simulation data. In a third aspect of the present disclosure, a computer program product is proposed. The computer program product is tangibly stored on a non-transitory computer-readable medium and includes machine executable instructions which, when executed, cause a machine to execute steps of the method as described in accordance with the first aspect of the present disclosure. This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the present disclosure, nor is it intended to be used to limit the scope of the present disclosure. BRIEF DESCRIPTION OF THE DRAWINGS The above and other objectives, features, and advantages of the present disclosure will become more apparent, through the following detailed description of the example embodiments of the present disclosure with reference to the accompanying drawings in which the same reference symbols generally refer to the same elements. FIG. 1 illustrates a schematic diagram of an example of a simulation environment according to some embodiments of the present disclosure; FIG. 2 illustrates a flowchart of a method for simulation according to some embodiments of the present disclosure; FIG. 3 illustrates a schematic diagram of sorting a set of operations based on execution time according to some embodiments of the present disclosure; FIG. 4 illustrates a schematic diagram of a visualized representation of a write operation according to some embodiments of the present disclosure; FIG. 5 illustrates a schematic diagram of a visualized representation of a synthesis operation according to some embodiments of the present disclosure; FIG. 6 illustrates a schematic diagram of an example of training a generative adversarial network according to some embodiments of the present disclosure; FIG. 7 illustrates a schematic diagram of an example of using a generator of a generative adversarial network according to some embodiments of the present disclosure; and FIG. 8 illustrates a schematic block diagram of an example device that can be used to implement embodiments of the present disclosure. Throughout the drawings, the same or similar reference symbols refer to the same or similar elements. DETAILED DE