CN-121981180-A - Data processing method, device and computer equipment for predictor sparse computation in neural network

CN121981180ACN 121981180 ACN121981180 ACN 121981180ACN-121981180-A

Abstract

The invention provides a data processing method, a device and computer equipment for predictor sparse computation in a neural network, which relate to the field of artificial intelligence and are implemented by acquiring word element confidence coefficient data and target selection quantity, wherein the word element confidence coefficient data comprises a plurality of data elements; the method comprises the steps of modeling a soft selection weight array based on word element confidence data, calculating the value of a threshold parameter to be determined based on the model of the soft selection weight array, calculating the value of each weight element in the soft selection weight array based on the value of the determined threshold parameter, and carrying out weighted summation on word element characteristics corresponding to the word element confidence data by using the soft selection weight array to carry out sparse calculation, so that the differentiation of TopK selection in a predictor can be realized, no deviation of gradient calculation can be realized, standard back propagation training in a neural network is supported by the predictor, and the difference between predictor training and reasoning is eliminated.

Inventors

SHI JINYUAN
ZENG HAO

Assignees

墨芯人工智能科技(深圳)有限公司

Dates

Publication Date: 20260505
Application Date: 20260325

Claims (11)

1. A data processing method for predictor sparse computation in a neural network, the method comprising: Obtaining word element confidence data and a target selection number, wherein the word element confidence data comprises a plurality of data elements, the number of the plurality of data elements is larger than the target selection number, the word element confidence data is output by the predictor to be used for representing the importance degree of word element characteristics extracted from images, voices or texts, and the sparse calculation comprises the step of selecting the data element with the largest target selection number from the word element confidence data for calculation; Modeling a soft selection weight array based on the word element confidence data, wherein the soft selection weight array is a mask for selecting the target selection number of largest data elements from the word element confidence data, the modeled model comprises that each weight element in the soft selection weight array is obtained by adding a corresponding data element in the word element confidence data to a threshold parameter to be determined and applying a monotonically increasing continuous micro-function, and the sum of all weight elements in the soft selection weight array is the target selection number; calculating the value of the threshold parameter to be determined based on the model of the soft selection weight array; Calculating a value for each weight element in the soft selection weight array based on the determined value of the threshold parameter, and And carrying out weighted summation on the word element characteristics corresponding to the word element confidence coefficient data by using the soft selection weight array so as to carry out sparse calculation.
2. The method according to claim 1, wherein the method further comprises: Performing a back propagation calculation based on the upstream gradient returned from the output of the neural network and the value of the threshold parameter to obtain an input gradient of the lemma confidence data, and Based on the input gradient of the word element confidence data, parameters of the predictor are updated.
3. The method of claim 2, wherein the monotonically increasing continuous differentiable function is a Sigmoid function.
4. A method according to claim 3, wherein performing a back propagation calculation based on the upstream gradient returned from the output of the neural network and the value of the threshold parameter, obtaining the input gradient of the word element confidence data comprises: Calculating a first vector, wherein each element in the first vector is v i =σ'(x i +t ) And wherein the sigma' function is the derivative of the Sigmoid function, t X i is the ith data element of the word element confidence data for the threshold parameter; Summing all elements in the first vector to obtain a first scalar; Multiplying the first vector and the upstream gradient element by element to obtain an intermediate vector; Summing all elements in the intermediate vector to obtain a second scalar, and The intermediate vector is subtracted by the product of the quotient of the second scalar divided by the first scalar multiplied by the first vector to obtain the input gradient of the lemma confidence data.
5. The method according to any one of claims 1 to 4, wherein calculating the value of the threshold parameter to be determined based on the model of the soft selection weight array comprises: And searching the threshold parameter to be determined by adopting a dichotomy method so as to calculate the value of the threshold parameter to be determined.
6. The method according to any one of claims 1 to 4, further comprising: In the neural network reasoning stage, the hard mask output by the hard TopK operation is used for carrying out weighted summation with the word element confidence data so as to carry out the sparse calculation.
7. The method of any one of claims 1 to 4, wherein the neural network further comprises a primary network with which the predictor is capable of end-to-end joint training.
8. A data processing apparatus for predictor sparse computation in a neural network, the apparatus comprising: An acquisition module configured to acquire a word element confidence data and a target selection number, wherein the word element confidence data includes a plurality of data elements, the number of the plurality of data elements is greater than the target selection number, the word element confidence data is output by the predictor to be used for representing importance of word element features extracted from an image, voice or text, and the sparse computation includes selecting the data element with the largest target selection number from the word element confidence data for computation; A modeling module configured to model a soft selection weight array based on the word element confidence data, wherein the soft selection weight array is a mask for selecting the target selection number of largest data elements from the word element confidence data, the modeled model includes that each weight element in the soft selection weight array is obtained by adding a corresponding data element in the word element confidence data to a threshold parameter to be determined and applying a monotonically increasing continuous micro-function, and the sum of all weight elements in the soft selection weight array is the target selection number; A solution module configured to calculate values of the threshold parameters to be determined based on the model of the soft selection weight array; A weight calculation module configured to calculate a value of each weight element in the soft selection weight array based on the determined value of the threshold parameter, and And the sparse calculation module is configured to perform weighted summation on the word element characteristics corresponding to the word element confidence data by using the soft selection weight array so as to perform the sparse calculation.
9. A computer device, the computer device comprising: At least one processor; a memory having a computer program stored thereon, wherein the computer program, when executed by the at least one processor, causes the at least one processor to perform the method of any of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1-7.
11. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1-7.

Description

Data processing method, device and computer equipment for predictor sparse computation in neural network Technical Field The present disclosure relates to the field of artificial intelligence, and in particular to the fields of predictors and sparse processing, and more particularly to a data processing method, apparatus, computer device, computer readable storage medium and computer program product for predictor sparse computation in neural networks. Background In the field of sparse acceleration of large language models, a predictor (Predictor) based activation sparse scheme requires the selection of several of the most important neurons for computation. Disclosure of Invention The present disclosure provides a data processing method, apparatus, computer device, computer readable storage medium and computer program product for predictor sparse computation in neural networks. According to one aspect of the disclosure, a data processing method for predictor sparse computation in a neural network is provided, the method comprises the steps of obtaining word element confidence data and target selection quantity, wherein the word element confidence data comprises a plurality of data elements, the number of the plurality of data elements is larger than the target selection quantity, the word element confidence data is output by a predictor to be used for representing importance degree of word element characteristics extracted from images, voices or texts, sparse computation comprises the step of selecting the data elements with the largest target selection quantity from the word element confidence data for computation, the step of modeling a soft selection weight array based on the word element confidence data, the step of modeling is used for selecting masks of the data elements with the largest target selection quantity from the word element confidence data, the step of modeling comprises the step of adding corresponding data elements in the word element confidence data and threshold value parameters to be determined, the step of applying monotonically continuous micro functions, the sum of all the weight elements in the soft selection weight array is used as the target selection quantity, the step of calculating is used for computing the weight value of the corresponding weight elements in the soft selection weight array based on the word element confidence value, and the step of computing the weight value of the corresponding weight element is used for determining the threshold value of the weight array, and the step of weighting value is used for computing the weight value of the weight element to be determined. In some embodiments, the method further comprises performing a back propagation calculation based on the upstream gradient returned from the output of the neural network and the value of the threshold parameter to obtain an input gradient of the token confidence data, and updating the parameter of the predictor based on the input gradient of the token confidence data. In some embodiments, the monotonically increasing continuous differentiable function is a Sigmoid function. In some embodiments, performing a back propagation calculation based on the upstream gradient returned from the output of the neural network and the value of the threshold parameter to obtain the input gradient of the lemma confidence data includes calculating a first vector, wherein each element in the first vector is v i=σ'(xi +t) And wherein the sigma' function is the derivative of the Sigmoid function, tFor the threshold parameter, x i is the i-th data element of the word confidence data, summing all elements in the first vector to obtain a first scalar, multiplying the first vector by an upstream gradient element by element to obtain an intermediate vector, summing all elements in the intermediate vector to obtain a second scalar, and subtracting the second scalar from the intermediate vector to divide the product of the quotient of the first scalar and the product of the quotient of the first scalar and the intermediate vector to obtain the input gradient of the word confidence data. In some embodiments, calculating the value of the threshold parameter to be determined based on the model of the soft selection weight array includes finding the threshold parameter to be determined using a dichotomy to calculate the value of the threshold parameter to be determined. In some embodiments, in the neural network reasoning stage, hard masks output by hard TopK operations are used to weight sum with the word element confidence data for sparse computation. In some embodiments, the neural network further comprises a primary network with which the predictor is capable of end-to-end joint training. According to another aspect of the disclosure, a data processing apparatus for predictor sparse computation in a neural network is provided, the apparatus comprising an acquisition module configured to acquire word confidence data and a targ