CN-122021736-A - Supervised contrast learning with multiple positive examples

CN122021736ACN 122021736 ACN122021736 ACN 122021736ACN-122021736-A

Abstract

The present disclosure provides an improved training method that enables supervised contrast learning to be performed simultaneously across multiple positive and negative training examples. In particular, example aspects of the present disclosure are directed to improved supervised versions of batch contrast loss, which have been shown to be very effective for learning powerful representations in self-supervising settings. Thus, the proposed technique adapts contrast learning to fully supervised settings and also enables learning to occur simultaneously across multiple positive examples.

Inventors

D. Krishnan
P. Kosra
P. Tetwak
A. Y. Sarna
A. J. mahinot
C.LIU
P. J. Isola
Y.Tian
C.WANG

Assignees

谷歌有限责任公司

Dates

Publication Date: 20260512
Application Date: 20210412
Priority Date: 20200421

Claims (20)

1. A computer-implemented method for performing supervised contrast learning of an embedded representation, the method comprising: Obtaining, by a computing system comprising one or more computing devices, one or more inputs comprising a plurality of positive training examples associated with a first class of a plurality of classes and one or more negative training examples associated with one or more other classes of the plurality of classes, the plurality of positive training examples comprising an anchor example, the one or more other classes being different from the first class; Processing, by the computing system, the plurality of positive training examples using a neural network to obtain a plurality of positive embedded representations, respectively, and the one or more negative training examples to obtain one or more negative embedded representations, respectively, and Modifying, by the computing system, one or more values of one or more parameters of the neural network based at least in part on a contrast loss function, the contrast loss function based on a similarity of measurements of a plurality of training example pairs, the plurality of training example pairs including at least one positive pair between the anchor example and one or more other examples of the plurality of positive training examples and at least one negative pair between the anchor example and an example of the one or more negative training examples; Wherein modifying one or more values of one or more parameters of the neural network based at least in part on the contrast loss function causes the neural network to increase the at least one positive pair of similarities and decrease the at least one negative pair of similarities.
2. The computer-implemented method of claim 1, wherein the one or more inputs comprise text.
3. The computer-implemented method of claim 1, wherein the one or more inputs comprise audio.
4. The computer-implemented method of claim 1, wherein the one or more inputs comprise an image.
5. The computer-implemented method of claim 4, wherein the anchor case comprises an anchor image, the plurality of positive training cases comprises a plurality of positive images, and the one or more negative training cases comprises one or more negative images.
6. The computer-implemented method of claim 5, wherein at least one of the anchor image and the one or more positive images depicts different topics belonging to a same first class of the plurality of classes.
7. The computer-implemented method of claim 5, wherein the anchor image comprises an x-ray image.
8. The computer-implemented method of claim 5, wherein the anchor image comprises a set of LiDAR data.
9. The computer-implemented method of claim 5, wherein the anchor image comprises video data.
10. The computer-implemented method of claim 5, comprising enhancing the anchor image to generate at least one of the one or more positive images.
11. The computer-implemented method of claim 1, comprising: after modifying one or more values of one or more parameters of the neural network based at least in part on the loss function: Providing additional input to the neural network; Receiving an additional embedded representation of the additional input as an output of the neural network, and A prediction of the additional input is generated based at least in part on the additional embedded representation.
12. The computer-implemented method of claim 11, wherein the prediction comprises a classification prediction, a detection prediction, an identification prediction, a regression prediction, a segmentation prediction, or a similarity search prediction.
13. The computer-implemented method of claim 1, wherein the loss function comprises a sum of each pair of the at least one pair of pairs.
14. The computer-implemented method of claim 1, further comprising: processing the anchor instances with an encoder to obtain an anchor embedded representation of the anchor instances; The anchor embedded representation is processed with a projection head to obtain an anchor projected representation of the anchor instance.
15. The computer-implemented method of claim 1, wherein the loss function is inversely related to a similarity measure between an anchor and positive.
16. The computer-implemented method of claim 1, wherein the loss function is positively correlated with a similarity measure between an anchor and negative.
17. A computing system, comprising: One or more processors, and One or more computer-readable storage media storing: a neural network that has been trained by performing operations, comprising: Obtaining one or more inputs comprising a plurality of positive training examples associated with a first class of a plurality of classes and one or more negative training examples associated with one or more other classes of the plurality of classes, the plurality of positive training examples comprising an anchor example, the one or more other classes being different from the first class; Processing the plurality of positive training examples using a neural network to obtain a plurality of positive embedded representations, respectively, and processing the one or more negative training examples to obtain one or more negative embedded representations, respectively, and Modifying one or more values of one or more parameters of the neural network based at least in part on a contrast loss function, the contrast loss function based on measured similarity of a plurality of training example pairs, the plurality of training example pairs including at least one positive pair between the anchor example and one or more other examples of the plurality of positive training examples and at least one negative pair between the anchor example and an example of the one or more negative training examples; Wherein modifying one or more values of one or more parameters of the neural network based at least in part on a contrast loss function causes the neural network to increase the at least one positive pair of similarities and decrease the at least one negative pair of similarities, and Instructions that, when executed by the one or more processors, cause the computing system to perform operations comprising: receiving input, and The input is processed using the neural network to generate an embedded representation of the input.
18. The computing system of claim 17, the operations comprising: A prediction of the input is generated based at least in part on the embedded representation of the input.
19. The computing system of claim 17, the input comprising at least one of text data, audio data, or image data.
20. One or more computer-readable storage media storing a neural network that has been trained by performing operations comprising: Obtaining one or more inputs comprising a plurality of positive training examples associated with a first class of a plurality of classes and one or more negative training examples associated with one or more other classes of the plurality of classes, the plurality of positive training examples comprising an anchor example, the one or more other classes being different from the first class; Processing the plurality of positive training examples using a neural network to obtain a plurality of positive embedded representations, respectively, and processing the one or more negative training examples to obtain one or more negative embedded representations, respectively, and Modifying one or more values of one or more parameters of the neural network based at least in part on a contrast loss function, the contrast loss function based on measured similarity of a plurality of training example pairs, the plurality of training example pairs including at least one positive pair between the anchor example and one or more other examples of the plurality of positive training examples and at least one negative pair between the anchor example and an example of the one or more negative training examples; Wherein modifying one or more values of one or more parameters of the neural network based at least in part on the contrast loss function causes the neural network to increase the at least one positive pair of similarities and decrease the at least one negative pair of similarities.

Description

Supervised contrast learning with multiple positive examples The application is a divisional application of an application patent application with international application date of 2021, 4 months and 12 days, chinese application number of 202180007180.4 and the name of 'supervised contrast learning by a plurality of positive examples'. Cross Reference to Related Applications The present application claims priority and benefit from U.S. provisional patent application No. 63/013,153 filed on 21 months 4 in 2020. U.S. provisional patent application No. 63/013,153 is incorporated herein by reference in its entirety. Technical Field The present disclosure relates generally to systems and methods for contrast learning of visual representations. More specifically, the present disclosure relates to systems and methods for performing supervised contrast learning across multiple positive examples. Background Cross entropy loss is probably the most widely used loss function in supervised learning. It is naturally defined as the KL-divergence between two discrete distributions, the empirical tag distribution of logits (the discrete distribution of 1-hot vectors) and the empirical distribution. Many efforts have explored the drawbacks of such losses, such as lack of robustness to noise tags and possibly poor margin, which leads to reduced generalization performance. However, in practice, most of the proposed alternatives do not seem to be better for large-scale data sets, such as ImageNet, which can be demonstrated by the continued use of cross entropy to achieve the most advanced results. Many of the improvements proposed for conventional cross entropy actually involve relaxation of the loss definition, especially where the reference distribution is axis aligned. Models trained with these modifications show improved generalization ability, robustness and calibration. However, the proposed improvement does not completely eliminate the drawbacks of the cross entropy loss method. Disclosure of Invention Aspects and advantages of embodiments of the disclosure will be set forth in part in the description which follows, or may be learned from the description, or may be learned by practice of the embodiments. One example aspect of the present disclosure is directed to a computing system that performs supervised contrast learning of visual representations. The computing system includes one or more processors and one or more non-transitory computer-readable media collectively storing instructions that are configured to process an input image to generate an embedded representation of the input image, a projection head neural network configured to process the embedded representation of the input image to generate a projected representation of the input image, and instructions that when executed by the one or more processors cause the computing system to perform operations. The operations include obtaining an anchor image associated with a first class of the plurality of classes, a plurality of positive images associated with the first class, and one or more negative images associated with one or more other classes of the plurality of classes, the one or more other classes being different from the first class. The operations include processing the anchor image with a base encoder neural network to obtain an anchor embedded representation of the anchor image, processing the plurality of positive images to obtain a plurality of positive embedded representations, respectively, and processing the one or more negative images to obtain one or more negative embedded representations, respectively. The operations include processing the anchor embedded representation with a projection head neural network to obtain an anchor projection representation of the anchor image, processing the plurality of positive embedded representations to obtain a plurality of positive projection representations, respectively, and processing the one or more negative embedded representations to obtain one or more negative projection representations, respectively. The operations include evaluating a loss function that evaluates a similarity measure between the anchor projected representation and each of the plurality of forward projected representations and each of the one or more reverse projected representations. The operations include modifying one or more values of one or more parameters of at least a base encoder neural network based at least in part on the loss function. Other aspects of the disclosure are directed to various systems, apparatuses, non-transitory computer-readable media, user interfaces, and electronic devices. These and other features, aspects, and advantages of various embodiments of the present disclosure will become better understood with reference to the following description and appended claims. The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the discl