EP-4369298-B1 - HDR-BASED AUGMENTATION FOR CONTRASTIVE SELF-SUPERVISED LEARNING

EP4369298B1EP 4369298 B1EP4369298 B1EP 4369298B1EP-4369298-B1

Inventors

KLEIN, Tassilo
NABI, Moin

Dates

Publication Date: 20260513
Application Date: 20230831

Claims (12)

A computer-implemented method comprising: receiving a first image (210A) and a second image (230A) as inputs for contrastive self-supervised learning; applying a high dynamic range augmentation (202) to the first image (210A) to generate a first pair of views (212A, 212B); applying the high dynamic range augmentation (204) to the second image (230A) to generate a second pair of views (232A, 232B); applying a first convolutional neural network (220A, 220B) to the first pair of views (212A, 212B) to output a first pair of encoded representations (222A, 222B); applying a second convolutional neural network (234A, 234B) to the second pair of views (232A, 232B) to output a second pair of encoded representations (238A, 238B); projecting the first pair of encoded representations (222A, 222B) to form first projected representations (294A, 294B); projecting the second pair of encoded representations (238A, 238B) to form second projected representations (294C, 294D); and training a machine learning model (200) using the high dynamic range augmentations (202, 204) and an objective function that provides contrastive self-supervised learning.
The method of claim 1, wherein the first image (210A) and the second image (230A) are each selected from an image library of unlabeled images.
The method of claim 1 or 2, wherein the first image (210A) and the second images (230A) are dissimilar images that depict different content.
The method of any one of the preceding claims, wherein the high dynamic range augmentation (202) used to generate the first pair of views (212A, 212B) comprises a synthetic high dynamic range generation of the first pair of views (212A, 212B), and wherein the high dynamic range augmentation (204) used to generate the second pair of views (232A, 232B) comprises the synthetic high dynamic range generation of the second pair of views (232A, 232B).
The method of claim 4, further comprising: selecting, high dynamic range augmentation (202, 204), from a group of augmentations available for use to augment the first image (210A) and the second image (230A).
The method of any one of the preceding claims, wherein a first encoder projects the first pair of encoded representations (222A, 222B) to form the first projected representations (294A, 294B).
The method of any one of the preceding claims, wherein a second encoder projects the second pair of encoded representations (238A, 238B) to form the second projected representations (294C, 294D).
The method of any one of the preceding claims, wherein a first neural network (292A, 292B) comprising a first multi-layer perceptron projects the first pair of encoded representations (222A, 222B) to form the first projected representations (294A, 294B).
The method of any one of the preceding claims, wherein a second neural network (292C, 292D) comprising a second multi-layer perceptron projects the second pair of encoded representations (238A, 238B) to form the second projected representations (294C, 294D).
The method of any one of the preceding claims, further comprising: deploying the trained machine learning model (200) to perform an image classification task during an inference phase of the machine learning model (200).
A system (500), comprising: at least one data processor (510); and at least one memory (520) storing instructions which, when executed by the at least one data processor (510), result in operations according to a method of any one of the preceding claims.
A non-transitory computer-readable storage medium (530) including instructions which, when executed by at least one data processor (510), result in operations according to a method of any one of claims 1 to 10.

Description

FIELD The present disclosure generally relates to machine learning. BACKGROUND Deep learning machine learning (ML) models may require human supervision during the training of the ML model. The robustness ML model may depend on various aspects such as types of images, quantity of images, and variability of the images in the training, and the like. As such, an image training set lacking in these aspects may deteriorate the performance of the ML model. For example, a ML learning model (which is used to detect or recognized objects in an image) may be training with a "poor" image training set, in which case the ML model may have poor performance such as in scenarios with fine-grained boundaries between object categories. Document Cheng Ting et al: "A Simple Framework for Contrastive Learning of Visual Representations", Github Page for the corresponding paper in ICML2020, 9 November 2021 (2021-11-09), XP093120822, discloses a framework for contrastive learning of visual representations, wherein contrastive self-supervised learning algorithms are simplified without requiring specialized architectures or a memory bank. Document Cheng Ting et al: "A Simple Framework for Contrastive Learning of Visual Representations", 1 July 2020 (2020-07-01), pages 1-20, XP093037179, DOI: 10.48550/arXiv.2002.05709, discloses a similar framework for contrastive learning of visual representations. Document US 2021/0286997 A1 relates to detecting objects from high resolution image. Part images are generated based on a preceding object detection result and object tracking result with respect to a high-resolution image and augmented images are generated by applying data augmentation to the part images. Based on artificial intelligence (AI), an object can thus be detected and tracked by using the generated augmented images, and re-inference can be performed based on the detection and tracking result. SUMMARY Methods, systems, and articles of manufacture, including computer program products, are provided for high dynamic range (HDR) augmentation. According to an aspect, a system includes at least one data processor and at least one memory storing instructions which, when executed by the at least one data processor, result in operations including: receiving a first image and a second image as inputs for contrastive self-supervised learning; applying a high dynamic range augmentation to the first image to generate a first pair of views; applying the high dynamic range augmentation to the second image to generate a second pair of views; applying a first convolutional neural network to the first pair of views to output a first pair of encoded representations; applying a second convolutional neural network to the second pair of views to output a second pair of encoded representations; projecting the first pair of encoded representations to form first projected representations; projecting the second pair of encoded representations to form second projected representations; and training a machine learning model using the high dynamic range augmentations and an objective function that provides contrastive self-supervised learning. In some variations, one or more of the features disclosed herein including the following features can optionally be included in any feasible combination. The first image and the second image may each be selected from an image library of unlabeled images. The first image and the second images may be dissimilar images that depict different content. The high dynamic range augmentation may be used to generate the first pair of views comprises a synthetic high dynamic range generation of the first pair of views. The high dynamic range augmentation may be used to generate the second pair of views comprises the synthetic high dynamic range generation of the second pair of views. The high dynamic range augmentation may be selected from a group of augmentations available for use to augment the first image and the second image. A first encoder may project the first pair of encoded representations to form the first projected representations. A second encoder may project the second pair of encoded representations to form the second projected representations. A first neural network may include a first multi-layer perceptron projects the first pair of encoded representations to form the first projected representations. A second neural network may include a second multi-layer perceptron projects the second pair of encoded representations to form the second projected representations. The trained machine learning model may be deployed to perform an image classification task during an inference phase of the machine learning model. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive. Further features and/or variations may be provided in addition to those set forth herein. For example, the implementations described herein may be