CN-121996894-A - Data processing method and device

CN121996894ACN 121996894 ACN121996894 ACN 121996894ACN-121996894-A

Abstract

The embodiment of the application discloses a data processing method and a data processing device, relates to the technical field of computers, and is used for realizing data sampling of target distribution. The method includes determining probability distribution of data in an initial data set over a plurality of intervals of the initial data set, constructing a proposed distribution function based on the probability distribution of data in the initial data set over the plurality of intervals of the initial data set, determining a target offset based on the proposed distribution function and the target distribution function, and generating the target data set based on the target offset and an interval range of the initial data set. According to the scheme, based on the characteristic of discrete and low dimension on the large model evaluation length distribution, a refused sampling method is adopted, firstly, the sampled data in the initial data set are partitioned, the probability distribution of the sampled data in the interval of the initial data set is counted, then the target offset is found by traversing the limited intervals, and the probability density of the sample distribution is added with the target offset, so that the data sampling of the target distribution is realized.

Inventors

Hong Yijun
LI XIAOSONG
CHEN GONG
LIN HUAN
ZANG HUI
ZHANG ZIYANG

Assignees

华为技术有限公司

Dates

Publication Date: 20260508
Application Date: 20241105

Claims (11)

1. A method of data processing, comprising: Determining a probability distribution of data in an initial data set over a plurality of intervals of the initial data set; constructing a proposal distribution function according to the probability distribution; Determining a target offset according to the proposed distribution function and the target distribution function; and generating a target data set according to the target offset and the interval range of the initial data set.
2. The method of claim 1, wherein an initial data set interval corresponding to any sample in the target data set satisfies: bias+q (a) is less than or equal to p (a), bias is the target offset, q (x) is the proposed distribution function, p (x) is the target distribution function, and a is any sample in the target data set corresponding to an initial data set interval.
3. The method of claim 1 or 2, wherein the target offset is a maximum difference between the proposed distribution function and the target distribution function.
4. A method according to any one of claims 1 to 3, wherein the target distribution function is a gaussian probability distribution function, a ziff distribution zipf function, a binomial distribution function, a poisson distribution function, a geometric distribution function, a super-geometric distribution function, a uniform distribution function, an exponential distribution function or a normal distribution function.
5. The data processing device is characterized by comprising a receiving and transmitting unit and a processing unit; The processing unit is used for determining probability distribution of data in an initial data set on a plurality of intervals of the initial data set; The processing unit is further used for constructing a proposal distribution function according to the probability distribution; the processing unit is further used for determining a target offset according to the proposed distribution function and the target distribution function; the processing unit is further configured to generate a target data set according to the target offset and the interval range of the initial data set.
6. The apparatus of claim 5, wherein an initial data set interval corresponding to any sample in the target data set satisfies: bias+q (a) is less than or equal to p (a), bias is the target offset, q (x) is the proposed distribution function, p (x) is the target distribution function, and a is the initial data set interval corresponding to any sample in the target data set.
7. The apparatus of claim 5 or 6, wherein the target offset is a maximum difference between the proposed distribution function and the target distribution function.
8. The apparatus of any one of claims 5 to 7, wherein the objective distribution function is a gaussian probability distribution function, a ziff distribution function, a binomial distribution function, a poisson distribution function, a geometric distribution function, a super-geometric distribution function, a uniform distribution function, an exponential distribution function, or a normal distribution function.
9. A data processing apparatus comprising at least one processor and a memory, wherein the at least one processor executes programs or instructions stored in the memory to cause the data processing apparatus to implement the method of any one of the preceding claims 1 to 4.
10. A computer readable storage medium storing a computer program, characterized in that the computer program, when run on a computer or a processor, causes the computer or the processor to implement the method of any one of the preceding claims 1 to 4.
11. A computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to carry out the method of any one of claims 1 to 4.

Description

Data processing method and device Technical Field The embodiment of the application relates to the technical field of computers, in particular to a data processing method and device. Background Large models, such as large language models (large language model, LLM), refer to deep-learning models with tens or even hundreds of millions of parameters, including neural network models of very large scale parameters (typically more than a billion). These models are widely used in the field of natural language processing (natural language processing, NLP) and are expanding gradually to other fields such as computer vision, audio processing, and multimodal data processing. At present, aiming at the capability of the large model in the aspect of mainly focusing on precision, in order to objectively evaluate the reasoning performance of the large model, a data set covering input and output distribution with different lengths needs to be constructed so as to evaluate the performance related to the generation capability of the model. Disclosure of Invention The embodiment of the application provides a data processing method and device, which are used for realizing data sampling of target distribution. In order to achieve the above purpose, the embodiment of the application adopts the following technical scheme: In a first aspect, an embodiment of the present application provides a data processing method, the method including determining probability distribution of data in an initial data set over a plurality of intervals of the initial data set, constructing a proposed distribution function according to the probability distribution of data in the initial data set over the plurality of intervals of the initial data set, determining a target offset according to the proposed distribution function and the target distribution function, and generating the target data set according to the target offset and an interval range of the initial data set. According to the scheme provided by the embodiment of the application, based on the characteristic of discrete and low dimension on the large model evaluation length distribution, a refused sampling method is adopted, firstly, the sampled data in the initial data set is partitioned, the probability distribution of the sampled data in the interval of the initial data set is counted, then the maximum difference (namely the target offset) of the probability density of the sampled sample and the target distribution function is found by traversing the limited intervals, the probability density of the sample distribution is added with the maximum difference to replace the multiplication of the probability density of the sample distribution and the K value in the conventional refused sampling method, the probability density of the sample is ensured to be larger than the current distribution density, and the difficulty that the refused sampling K value is difficult to find is skillfully avoided, so that the target data set is generated with lower calculation cost, and the data sampling of the target distribution is realized. In one possible implementation, the initial data set interval corresponding to any sample in the target data set is satisfied by bias+q (a). Ltoreq.p (a), bias is the target offset, q (x) is the proposed distribution function, p (x) is the target distribution function, and a is the initial data set interval corresponding to any sample in the target data set. It will be appreciated that the implementation of rejecting the sample is relatively simple, and the method provided by the embodiments of the present application adds a bias+q (a). Ltoreq.p (a) determination condition during the sampling process to determine whether to accept or reject the sample, which does not require complex mathematical derivation or advanced algorithms. In one possible implementation, the target offset is a maximum difference between the proposed distribution function and the target distribution function. According to the method provided by the embodiment of the application, the probability density of the sample distribution is added with the maximum difference value to replace multiplication of the probability density of the sample distribution and the K value in the conventional refused sampling method, so that the probability density of the sample is ensured to be larger than the current distribution density, and the difficulty that the K value refused to be sampled is difficult to find is skillfully avoided. In one possible implementation, the objective distribution function is a gaussian probability distribution function, a ziff distribution (zipf) function, a binomial distribution function, a poisson distribution function, a geometric distribution function, a super-geometric distribution function, a uniform distribution function, an exponential distribution function, or a normal distribution function. It will be appreciated that the basic idea of rejecting samples is to generate samples