CN-122019384-A - Data construction method, data construction device, electronic equipment and storage medium
Abstract
The application provides a data construction method, a data construction device, electronic equipment and a storage medium, wherein the data construction method comprises the steps of acquiring a plurality of first test data from a production environment, fitting a Gaussian mixture model based on the plurality of first test data to obtain a fitted Gaussian mixture model, generating keyword vectors conforming to data distribution of the production environment through the fitted Gaussian mixture model, and generating second test data based on the keyword vectors.
Inventors
- HAN YUGUI
- Yan Dongrong
Assignees
- 中移(苏州)软件技术有限公司
- 中国移动通信集团有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260128
Claims (10)
- 1. A method of data construction, comprising: Acquiring a plurality of first test data from a production environment; fitting the Gaussian mixture model based on the first test data to obtain a fitted Gaussian mixture model; And generating a keyword vector conforming to the data distribution of the production environment through the fitted Gaussian mixture model, and generating second test data based on the keyword vector.
- 2. The method of claim 1, wherein fitting the gaussian mixture model based on the plurality of first test data results in a fitted gaussian mixture model, comprising: Encoding the plurality of first test data based on attribute information of the plurality of first test data to obtain a plurality of first keyword vectors; and fitting the Gaussian mixture model based on the plurality of first keyword vectors to obtain a fitted Gaussian mixture model.
- 3. The method of claim 2, wherein fitting the gaussian mixture model based on the plurality of first keyword vectors results in a fitted gaussian mixture model, comprising: And fitting the Gaussian mixture model through an expected maximization algorithm based on the plurality of first keyword vectors to obtain a fitted Gaussian mixture model.
- 4. The method of claim 3, wherein the generating a keyword vector conforming to the data distribution of the production environment by the fitted gaussian mixture model comprises: and carrying out data sampling based on the weight of each Gaussian component in the Gaussian mixture model to obtain the keyword vector.
- 5. The method of claim 2, wherein the generating second test data based on the keyword vector comprises: and generating the second test data through a large language model based on the keyword vector.
- 6. The method of claim 5, wherein the large language model is a pre-trained and fine-tuned large language model, wherein, The pre-training process comprises the following steps: Performing autoregressive training on the large language model based on unlabeled corpus data to obtain a pre-trained large language model; The fine tuning process includes: performing sequence processing on the plurality of first test data to obtain a plurality of sequence data; and fine tuning the pre-trained large language model based on the plurality of sequence data and the plurality of first keyword vectors to obtain a fine-tuned large language model.
- 7. The method of any one of claims 1 to 6, wherein after collecting a plurality of first test data from a production environment, the method further comprises: And preprocessing the plurality of first test data, wherein the preprocessing comprises one or more of removing irrelevant data, formatting and length screening.
- 8. A data construction apparatus, characterized in that, The data processing module is used for acquiring a plurality of first test data from the production environment; the distribution fitting module is used for fitting the Gaussian mixture model based on the plurality of first test data to obtain a fitted Gaussian mixture model; and the data generation module is used for generating a keyword vector conforming to the data distribution of the production environment through the fitted Gaussian mixture model, and generating second test data based on the keyword vector.
- 9. An electronic device comprising a processor and a memory for storing a computer program, the processor being adapted to invoke and run the computer program stored in the memory for performing the data construction method according to any of claims 5 to 7.
- 10. A storage medium storing a computer program for causing a computer to execute the data construction method according to any one of claims 1 to 7.
Description
Data construction method, data construction device, electronic equipment and storage medium Technical Field The embodiment of the application relates to the technical field of data processing, in particular to a data construction method, a data construction device, electronic equipment and a storage medium. Background In the field of software development and testing, data construction is a critical task, and plays an important role in ensuring system stability and functional correctness. Along with the improvement of the complexity of the system, the data requirements for diversification and meeting the actual scene in the test process are increasing, and the traditional manual or simple copying mode has difficulty in meeting the requirements of efficient test. Disclosure of Invention The embodiment of the application provides a data construction method, a data construction device, electronic equipment and a storage medium. The data construction method provided by the embodiment of the application comprises the following steps: Acquiring a plurality of first test data from a production environment; fitting the Gaussian mixture model based on the first test data to obtain a fitted Gaussian mixture model; And generating a keyword vector conforming to the data distribution of the production environment through the fitted Gaussian mixture model, and generating second test data based on the keyword vector. The data construction device provided by the embodiment of the application comprises: The data processing module is used for acquiring a plurality of first test data from the production environment; the distribution fitting module is used for fitting the Gaussian mixture model based on the plurality of first test data to obtain a fitted Gaussian mixture model; and the data generation module is used for generating a keyword vector conforming to the data distribution of the production environment through the fitted Gaussian mixture model, and generating second test data based on the keyword vector. The electronic equipment provided by the embodiment of the application comprises a processor and a memory, wherein the memory is used for storing a computer program, and the processor is used for calling and running the computer program stored in the memory to execute the data construction method provided by any embodiment of the application. The storage medium provided by the embodiment of the application is used for storing a computer program, and the computer program enables a computer to execute the data construction method provided by any embodiment of the application. The data construction method, the data construction device, the electronic equipment and the storage medium provided by the embodiment of the application are characterized in that firstly, a plurality of pieces of test data are collected from a production environment and are used as real data samples, secondly, a Gaussian mixture model is fitted based on the samples, so that a data distribution model consistent with the production environment is built, and finally, keyword vectors are generated through the model and second test data are further generated. Therefore, complicated data distribution in the production environment is simulated by using the Gaussian mixture model, so that the generated test data is closer to an actual scene in statistical characteristics, the representativeness of the test data is improved, and the reliability of the test is further improved. Drawings Fig. 1 is a schematic implementation flow chart of a data construction method according to an embodiment of the present application; fig. 2 is a schematic structural diagram of a data construction device according to an embodiment of the present application; Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application; fig. 4 is a schematic structural diagram of a chip provided in an embodiment of the present application. Detailed Description The following description of the technical solutions according to the embodiments of the present application will be given with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. It should be noted that, in the embodiment of the present application, the term "and/or" is merely an association relationship describing the association object, which means that three relationships may exist, for example, a and/or B, and may mean that a exists alone, while a and B exist together, and B exists alone. In addition, in the embodiment of the present application, the character "/", generally indicates that the front and rear associati