CN-121561269-B - Small sample expansion method based on data distribution and related equipment
Abstract
The invention discloses a small sample expansion method based on data distribution and related equipment, and relates to the technical field of data processing, wherein the method comprises the steps of obtaining an original small sample data set for virtual simulation; the method comprises the steps of selecting a preset original sample randomly from an original small sample data set, acquiring coordinates of k adjacent original samples, selecting two adjacent original samples randomly from the k adjacent original samples, generating coordinates of a first temporary sample, coordinates of a second temporary sample and coordinates of a third temporary sample according to the coordinates of the first adjacent original sample, the coordinates of the second adjacent original sample and the coordinates of the preset original sample, forming a second triangle by the coordinates of the first temporary sample, the coordinates of the second temporary sample and the coordinates of the third temporary sample, taking the center coordinates of the second triangle as the coordinates of a newly generated synthesized sample, and obtaining the synthesized sample according to the coordinates of the synthesized sample. Compared with the prior SMOTE technology, the invention improves the phenomenon of sample marginalization.
Inventors
- ZHAO KAIBIN
- GUO GUANGLEI
- LIU GANG
- LI XIN
Assignees
- 北京亦庄可重复使用火箭技术创新中心有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251118
Claims (6)
- 1. A small sample expansion method based on data distribution, comprising: s1, acquiring an original small sample data set for virtual simulation, and representing each original sample in the original small sample data set by corresponding coordinates; S2, randomly selecting an original sample from the original small sample data set as a preset original sample; s3, acquiring coordinates of k adjacent original samples corresponding to the preset original samples according to the coordinates of the preset original samples; S4, randomly selecting two adjacent original samples from the k adjacent original samples as a first adjacent original sample and a second adjacent original sample, wherein the coordinates of the first adjacent original sample, the coordinates of the second adjacent original sample and the coordinates of the preset original sample form a first triangle; S5, generating a first temporary sample coordinate, a second temporary sample coordinate and a third temporary sample coordinate according to the first neighbor original sample coordinate, the second neighbor original sample coordinate and the preset original sample coordinate, wherein the first temporary sample coordinate, the second temporary sample coordinate and the third temporary sample coordinate form a second triangle, and the second triangle is completely positioned in the first triangle; s6, taking the center coordinates of the second triangle as the coordinates of a newly generated synthetic sample, and obtaining the synthetic sample according to the coordinates of the synthetic sample; According to the coordinates of the preset original samples, the coordinates of k adjacent original samples corresponding to the preset original samples are obtained, including: The Euclidean distance from the coordinates of each original sample except the coordinates of the preset original sample in the original small sample data set to the coordinates of the preset original sample is calculated; Sequencing all original samples except the preset original samples according to the sequence from small Euclidean distance to large Euclidean distance to obtain a first sequence, selecting coordinates of the first k adjacent original samples in the first sequence as coordinates of k adjacent original samples of the coordinates of the preset original samples, or sequencing all original samples except the preset original samples according to the sequence from large Euclidean distance to small Euclidean distance to obtain a second sequence, and selecting coordinates of the last k adjacent original samples in the second sequence as coordinates of k adjacent original samples of the coordinates of the preset original samples; Generating the coordinates of the first temporary sample, the coordinates of the second temporary sample and the coordinates of the third temporary sample according to the coordinates of the first neighboring original sample, the coordinates of the second neighboring original sample and the coordinates of the preset original sample, including: The method comprises the steps of generating a coordinate of a first excessive sample between a coordinate of a first adjacent original sample and a coordinate of a second adjacent original sample in a random linear interpolation mode, generating a coordinate of a first temporary sample between the coordinate of the first excessive sample and a coordinate of a preset original sample in a random linear interpolation mode, generating a coordinate of a second excessive sample between the coordinate of the first adjacent original sample and the coordinate of the preset original sample in a random linear interpolation mode, generating a coordinate of a second temporary sample between the coordinate of the second excessive sample and the coordinate of the second adjacent original sample in a random linear interpolation mode, generating a coordinate of a third excessive sample between the coordinate of the second adjacent original sample and the coordinate of the preset original sample in a random linear interpolation mode, and generating a coordinate of the third temporary sample between the coordinate of the third excessive sample and the coordinate of the first adjacent original sample in a random linear interpolation mode.
- 2. The small sample expansion method based on data distribution of claim 1, further comprising: and repeatedly executing S2 to S6 until the number of generated synthesized samples reaches N, wherein N is a positive integer.
- 3. The small sample expansion system based on data distribution is characterized by comprising a data set acquisition module, an original sample selection module, a coordinate acquisition module, a neighbor original sample selection module, a coordinate generation module and a sample synthesis module; the data set acquisition module is used for acquiring an original small sample data set for virtual simulation, and representing each original sample in the original small sample data set by corresponding coordinates; The original sample selecting module is used for randomly selecting an original sample from the original small sample data set as a preset original sample; the coordinate acquisition module is used for acquiring coordinates of k adjacent original samples corresponding to the preset original samples according to the coordinates of the preset original samples; The neighbor original sample selection module is used for randomly selecting two neighbor original samples from the k neighbor original samples as a first neighbor original sample and a second neighbor original sample, wherein the coordinates of the first neighbor original sample, the coordinates of the second neighbor original sample and the coordinates of the preset original samples form a first triangle; The coordinate generation module is used for generating a coordinate of a first temporary sample, a coordinate of a second temporary sample and a coordinate of a third temporary sample according to the coordinate of a first adjacent original sample, the coordinate of a second adjacent original sample and the coordinate of the preset original sample, wherein the coordinate of the first temporary sample, the coordinate of the second temporary sample and the coordinate of the third temporary sample form a second triangle, and the second triangle is completely positioned in the first triangle; the sample synthesis module is used for taking the center coordinates of the second triangle as the coordinates of the newly generated synthesized sample, and obtaining the synthesized sample according to the coordinates of the synthesized sample; the coordinate acquisition module is specifically configured to: The Euclidean distance from the coordinates of each original sample except the coordinates of the preset original sample in the original small sample data set to the coordinates of the preset original sample is calculated; Sequencing all original samples except the preset original samples according to the sequence from small Euclidean distance to large Euclidean distance to obtain a first sequence, selecting coordinates of the first k adjacent original samples in the first sequence as coordinates of k adjacent original samples of the coordinates of the preset original samples, or sequencing all original samples except the preset original samples according to the sequence from large Euclidean distance to small Euclidean distance to obtain a second sequence, and selecting coordinates of the last k adjacent original samples in the second sequence as coordinates of k adjacent original samples of the coordinates of the preset original samples; The coordinate generation module is specifically configured to: The method comprises the steps of generating a coordinate of a first excessive sample between a coordinate of a first adjacent original sample and a coordinate of a second adjacent original sample in a random linear interpolation mode, generating a coordinate of a first temporary sample between the coordinate of the first excessive sample and a coordinate of a preset original sample in a random linear interpolation mode, generating a coordinate of a second excessive sample between the coordinate of the first adjacent original sample and the coordinate of the preset original sample in a random linear interpolation mode, generating a coordinate of a second temporary sample between the coordinate of the second excessive sample and the coordinate of the second adjacent original sample in a random linear interpolation mode, generating a coordinate of a third excessive sample between the coordinate of the second adjacent original sample and the coordinate of the preset original sample in a random linear interpolation mode, and generating a coordinate of the third temporary sample between the coordinate of the third excessive sample and the coordinate of the first adjacent original sample in a random linear interpolation mode.
- 4. The small sample expansion system based on data distribution of claim 3, further comprising a repeated calling module, wherein the repeated calling module is configured to repeatedly call the original sample selection module, the coordinate acquisition module, the neighboring original sample selection module, the coordinate generation module, and the sample synthesis module until the number of generated synthesized samples reaches N, where N is a positive integer.
- 5. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a data distribution based small sample expansion method according to any of claims 1 to 2 when the computer program is executed.
- 6. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements a small sample expansion method based on data distribution according to any of claims 1 to 2.
Description
Small sample expansion method based on data distribution and related equipment Technical Field The present invention relates to the field of data processing technologies, and in particular, to a small sample expansion method based on data distribution and related devices. Background With the development of socioeconomic performance and the advancement of technology, the application of virtual simulation technology is becoming wider and wider. The virtual simulation technology has the advantages of low cost, safe process, good repeatability, convenience, high efficiency and the like, and can obviously shorten the predictive process of the product performance. At the same time, reliability evaluation of simulation systems is also receiving more and more attention. In order to obtain a more accurate evaluation result, a large amount of simulation data and measured data are required to be compared and analyzed. Although the data can be obtained by carrying out real simulation and actual measurement, the method has the unavoidable defects of high price, long time consumption and the like, and cannot really meet the actual demands. The limited small sample data is difficult to clearly and accurately display the whole information of the process, and the prediction effect of the built model is necessarily affected. In the prior art, SMOTE oversampling is a classical method of small sample expansion, where a new composite sample is generated between two original samples by linear interpolation. The basic principle of the algorithm is that each sample is randomly selected from the original samplesRoot sample as a composite new sample, next slaveA kind of electronic deviceRandomly selecting one sample from the adjacent samples as an auxiliary sample for synthesizing a new sample, and then selecting the sampleThe auxiliary sample space corresponding to the auxiliary sample space is represented by the formula: performing linear interpolation in which Is thatRandom numbers in the interval, repetitionSecondary, finally generateAnd synthetic samples as shown in figure 1. However, existing SMOTE algorithms, while mitigating the overfitting caused by randomly replicated samples, have found widespread use in many areas, have also suffered from problems. SMOTE is along only selected samples and anyThat line between neighbors generates new composite samples. The method cannot generate data approaching to the original discrete samples, and the generated samples have larger sparse characteristics, namely the original samples are sparse still. At the same time, the problem of data marginalization is also easy to occur because the distribution of data in the combination is fixed. Therefore, a small sample expansion method capable of overcoming the defects is urgently needed, so that a synthetic sample with more reasonable distribution and more abundant characteristics can be generated in a geometric space formed by an original sample, and actual engineering applications such as credibility evaluation of a simulation system are effectively supported. Disclosure of Invention The invention aims to solve the technical problem of the prior art, and particularly provides a small sample expansion method based on data distribution and related equipment, wherein the method comprises the following steps of: 1) In a first aspect, the present invention provides a small sample expansion method based on data distribution, and the specific technical scheme is as follows: s1, acquiring an original small sample data set for virtual simulation, and representing each original sample in the original small sample data set by corresponding coordinates; s2, randomly selecting an original sample from the original small sample data set as a preset original sample; S3, acquiring coordinates of k adjacent original samples corresponding to the preset original samples according to the coordinates of the preset original samples; s4, randomly selecting two adjacent original samples from k adjacent original samples as a first adjacent original sample and a second adjacent original sample, wherein the coordinates of the first adjacent original sample, the coordinates of the second adjacent original sample and the coordinates of a preset original sample form a first triangle; s5, generating a first temporary sample coordinate, a second temporary sample coordinate and a third temporary sample coordinate according to the first neighbor original sample coordinate, the second neighbor original sample coordinate and the preset original sample coordinate, wherein the first temporary sample coordinate, the second temporary sample coordinate and the third temporary sample coordinate form a second triangle, and the second triangle is completely positioned in the first triangle; S6, taking the center coordinates of the second triangle as the coordinates of the newly generated synthesized sample, and obtaining the synthesized sample according to the coordinates of the synthes