EP-4740116-A1 - BALANCING TRAINING DATA FOR TRAINING NEURAL NETWORKS

EP4740116A1EP 4740116 A1EP4740116 A1EP 4740116A1EP-4740116-A1

Abstract

A computer-implemented method is provided for processing a training database for training a neural network to perform a computational task, the training database comprising training items, to obtain a weight value for each training item. The method comprises: for each of one or more attributes, determining a corresponding item attribute vector for each training item which is a vector indicative of a likelihood of the training item exhibiting the attribute; and for each training item determining a corresponding weight value by: defining a loss function of the weight values and the item attribute vectors; and updating the weight values to reduce the loss function. A corresponding computer system and computer program product are also provided.

Inventors

ALABDULMOHSIN, Ibrahim
WANG, XIAO
STEINER, ANDREAS PETER
GOYAL, PRIYA
D'AMOUR, Alexander Nicholas
ZHAI, Xiaohua

Assignees

GDM Holding LLC

Dates

Publication Date: 20260513
Application Date: 20240926

Claims (20)

1. A computer-implemented method of processing a training database for training a neural network to perform a computational task, the training database comprising training items, to obtain a weight value for each training item, the method comprising: for each of one or more attributes, determining a corresponding item attribute vector for each training item which is a vector indicative of a likelihood of the training item exhibiting the attribute; and for each training item determining a corresponding weight value by: defining a loss function of the weight values and the item attribute vectors; and updating the weight values to reduce the loss function.
2. The method of claim 1, wherein each training item comprises at least one audiovisual element.
3. The method of claim 1 in which the loss function includes, for each attribute, a corresponding attribute term based on a corresponding first sum over the training items, weighted by the corresponding weight values, of a first function of the item attribute vector for the corresponding training item and attribute.
4. The method of claim 3, in which the first function is the item attribute vector minus a desired attribute vector corresponding to the attribute.
5. The method of claim 3 or claim 4, in which each attribute term is non-zero only if the corresponding first sum is above a first threshold.
6. The method of claim 5, in which each attribute term is the greater of (i) zero and (ii) the corresponding first sum minus the first threshold.
7. The method of any preceding claim, which further comprises: for each of one more characteristics, determining a corresponding item characteristic vector for each training item, which is a vector indicative of a likelihood of the training item exhibiting the characteristic, the loss function including, for each attribute and each characteristic, a corresponding attribute characteristic term based on a corresponding second sum over the training items, weighted by the corresponding weight values, of a second function of: the item attribute vector for the corresponding training item and attribute, and the item characteristic vector for the corresponding training item and characteristic.
8. The method of claim 7 when dependent on claim 4, in which the second function is a product of (i) the item attribute vector minus the desired attribute vector corresponding to the attribute, and (ii) the item characteristic vector.
9. The method of claim 7 or claim 8 in which each attribute characteristic term is nonzero only if the corresponding second sum is above a second threshold.
10. The method of claim 9, in which each attribute characteristic term is the greater of (i) zero and (ii) the corresponding second sum minus the second threshold.
11. The method of any of claims 2 to 6, or of claims 7 to 10 when dependent on claim 2, in which each training item further comprises a textual descriptor which comprises one or more tokens selected from a vocabulary of tokens, and the item attribute vector for each training item depends upon both the corresponding at least one audio-visual element and the corresponding textual descriptor.
12. The method of claim 11 in which the item attribute vector, for each training item and attribute, is a concatenation of a first vector indicating whether the corresponding at least one audio-visual element exhibits the attribute, and a second vector indicating whether the corresponding text descriptor exhibits the attribute.
13. The method of claim 12, in which the audio-visual items are images, and which includes deriving the first vector for each training item and attribute from the output of an object detection neural network model upon receiving the corresponding at least one audiovisual item.
14. The method of claim 12 or claim 13, which includes deriving the second vector for each training item and attribute, by determining if the corresponding textual descriptor meets a corresponding first criterion.
15. The method of any of claims 11 to 14, when dependent on claim 7, in which the item characteristic vector for each training item and each characteristic depends upon both the corresponding at least one audio-visual element and the corresponding textual descriptor.
16. The method of claim 15, in which the item characteristic vector, for each training item and characteristic, is a concatenation of a third vector indicating whether the corresponding at least one audio-visual element exhibits the characteristic, and a fourth vector indicating whether the corresponding text descriptor exhibits the attribute.
17. The method of claim 16, in which the audio-visual items are images, and which includes deriving the third vector for each training item and characteristic from the output of an object detection neural network model upon receiving the corresponding at least one audio- visual item.
18. The method of claim 16 or claim 17, which includes deriving the fourth vector for each training item and characteristic by determining if the corresponding textual descriptor meets a corresponding second criterion.
19. The method of any preceding claim, further including defining the attributes by selecting one or more attributes and using a trained language model generating one or more additional attributes based on the selected attributes.
20. The method of any preceding claim in which the loss function further includes a penalty term which is a sum over the weight values of a measure of divergence of the weight value from a sub sampling rate.

Description

BALANCING TRAINING DATA FOR TRAINING NEURAL NETWORKS CROSS-REFERENCE TO RELATED APPLICATION [1] This application claims priority to U.S. Provisional Application No. 63/586,392, filed on September 28, 2023. The disclosure of the prior application is considered part of and is incorporated by reference in the disclosure of this application. BACKGROUND [2] This specification relates to systems for pre-processing a training database of training items which is to be used for neural network training. The training items may comprise audiovisual items, such as images (still images or videos) or audio items. Additionally or alternatively, the training items may comprise other data such as transactional records or textual records. [3] Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks are deep neural networks that include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. SUMMARY [4] The training databases used in many neural network training processes are drawn from the internet or other public repositories of e.g. audio-visual elements, and the distribution of audio-visual elements in such databases is often different from a desired distribution. This may be because it is different from a distribution of audio-visual elements which the trained neural network is intended to process or generate (e.g. there may be more images of cats than are encountered performing some audio-visual processing tasks). Thus, the present techniques allow a training database to be “rebalanced”, so that the performance of the neural network is improved (e.g. images having the attribute “cat” are removed and/or given less weight in the training). Similar rebalancing may be achieved for non audio-visual data associated with attributes, such as transactional records or textual records. [5] In particular, this specification describes technologies which enable a training database of training items to be processed, in order to be subsequently used to train a neural network to perform a computing task, which may for example be an audio-visual computing task. The training items may include at least one corresponding audio-visual item; that is, at least one image (which may be a still image or a moving (video) image, e.g. captured by a camera) and/or at least one audio item (a sound item lasting a time duration, e.g. a captured recording of one or more voices speaking). Additionally or alternatively, the training items may include non audio-visual data such as transactional records or textual records. [6] In general terms, the present technique proposes that, for each of one or more attributes (e.g., attributes which are believed to be over-represented in the training database), an item attribute vector is determined for each training item, indicating a likelihood of the training item exhibiting the attribute (e.g. the item attribute vector may optionally be a single value, e.g. a binary value indicating that the likelihood is above a threshold, or a real value varying with the likelihood; or it can include multiple values, e.g. a binary vector). A weight value is defined for each training item, and a loss function is defined based on the item attribute vectors and the weight vectors. The loss function is reduced (e.g. iteratively minimized) with respect to the weight values. [7] In one expression, there is provided a computer-implemented method of processing a training database for training a neural network to perform a computational task, the training database comprising training items, to obtain a weight value for each training item, the method comprising: a. for each of one or more attributes, determining a corresponding item attribute vector for each training item which is a vector indicative of a likelihood of the training item exhibiting the attribute; and b. for each training item determining a corresponding weight value by: i. defining a loss function of the weight values (e.g. using initial values for the weight values, which may be chosen to be all the same, or at random) and the item attribute vectors; and (e.g. repeatedly) ii. updating the weight values to reduce the loss function. [8] According to a further aspect of the disclosure, there is provided a system comprising one or more computers; and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the above method. [9] According to a further aspect of the disclosure, there is provided a computer-program product containing instructions that, when executed by one or more computers, cause the one or more computers to perf