EP-4742098-A1 - METHODS, SYSTEMS, AND COMPUTER PROGRAM PRODUCTS FOR DETERMINING MACHINE LEARNING MODELS FOR COMPUTER VISION

EP4742098A1EP 4742098 A1EP4742098 A1EP 4742098A1EP-4742098-A1

Abstract

A computer-implemented method of determining a machine learning model, the method comprising: providing a first training dataset; pre-training a first machine learning model on the first training dataset by self-supervised learning, SSL, wherein the pre-training by SSL comprises determining, in particular minimizing, a first loss function; and training the pre-trained first machine learning model on the first training dataset by knowledge distillation to generate at least one first representation of a data point comprised in the first training dataset, wherein the training by knowledge distillation comprises: determining, in particular minimizing, the first loss function and a second loss function indicative of a similarity of the first representation and a second representation generated by a second machine learning model.

Inventors

MAFLA DELGADO, Andrés
BITEN, Ali Furkan
KOUSKOURIDAS, Rigas

Assignees

Helsing GmbH

Dates

Publication Date: 20260513
Application Date: 20241107

Claims (15)

A computer-implemented method of determining a machine learning model, the method comprising: providing a first training dataset (212, 400); pre-training a first machine learning model (214) on the first training dataset (212, 400) by self-supervised learning, SSL, wherein the pre-training by SSL comprises determining, in particular minimizing, a first loss function (218); and training the pre-trained first machine learning model on the first training dataset (212, 400) by knowledge distillation to generate at least one first representation (308) of a data point comprised in the first training dataset (212, 400), wherein the training by knowledge distillation comprises: determining, in particular minimizing, the first loss function (218) and a second loss function (302) indicative of a similarity of the first representation (308) and a second representation generated by a second machine learning model (204).
The method of claim 1, wherein training by knowledge distillation comprises: computing a combined loss function (310) based on the first loss function (218) and the second loss function (302); and training the pre-trained first machine learning model based on the combined loss function (310), in particular by minimizing the combined loss function (310).
The method of any of the preceding claims, wherein the first loss function (218) is based on a contrastive loss (220), self-distillation loss (222), and/or masking loss (224).
The method of any of the preceding claims, wherein the trained first machine learning model (214) is applicable for computer vision, in particular as a backbone for an image classifier, an object detector, and/or a key point matching network.
The method of any of the preceding claims, wherein pre-training the first machine learning model (214) comprises training using DINO.
The method of any of the preceding claims, wherein the second machine learning model (204) has been trained on a second training dataset (202, 400) different from the first training dataset (212, 400).
The method of claim 6, wherein the second training dataset (202, 400) has a larger cardinality than the first training dataset (212, 400), and wherein the first training dataset (212, 400) is preferably a subset of the second training dataset (202, 400).
The method of any of the preceding claims, wherein the second machine learning model (204) has been trained by DINOv2.
The method of any of the preceding claims, wherein the second loss function (302) is based on a distance metric (304) and/or a cosine similarity (306).
The method of any of the preceding claims, wherein the first and/or second machine learning model (204) is applicable as a backbone for an image classifier, an object detector, a key point matching network.
The method of any of the preceding claims, wherein the first training dataset (212, 400) and/or second training dataset (202, 400) comprises image data and/or wherein the data point comprises an image.
The method of claim 11, wherein the image data comprises at least one subset, wherein each subset comprises an initial image (404), a local portion (406) of the initial image (404), and/or a global portion (408) of the initial image (404).
A system comprising one or more processors and one or more storage devices, wherein the system is configured to perform the computer-implemented method of any one of claims 1-12.
A computer program product for loading into a memory of a computer, comprising instructions, that, when executed by a processor of the computer, cause the computer to execute a computer-implemented method of any of claims 1-12.
A computer program product comprising a trained machine learning module obtainable by the computer-implemented method of any of claims 1-12.

Description

Technical Field The present disclosure relates to systems, methods, and computer program products for determining machine learning models that are applicable for computer vision. The present disclosure is applicable in the field of processing sensor data, in particular processing image data. Background An important task in computer vision is generating a representation of an image that captures its global and local features. Such a representation may be processed to perform further tasks, for example classifying images, recognizing an object on an image, or determining key points on an image. Such a task is solved more accurately if the representation captures more features of the image. It is known to train machine learning algorithms by supervised learning to process images and to generate representations of the images. However, this approach requires labelled data, which typically requires manual labelling. Another approach is self-supervised learning. However, representations obtained by self-supervised learning only capture a very limited part of the features of the image. Moreover, training efficiency is limited. In particular, a large number of training epochs are required to obtain a useful model, and even this requires a large training dataset. There is a need for systems and methods that overcome these shortcomings. Summary Disclosed and claimed herein are systems, methods, and devices for determining machine learning models for computer vision. A first aspect of the present disclosure relates to a computer-implemented method of determining a machine learning model. The method comprises the following steps: providing a first training dataset;pre-training a first machine learning model on the first training dataset by self-supervised learning, SSL, wherein the pre-training by SSL comprises determining, in particular minimizing a first loss function; andtraining the pre-trained first machine learning model on the first training dataset by knowledge distillation to generate at least one first representation of a data point comprised in the first training dataset. Herein, the training by knowledge distillation comprises determining, in particular minimizing, the first loss function and a second loss function indicative of a similarity of the first representation and a second representation generated by a second machine learning model. Determining the first and second loss function may be performed independently. Alternatively, or additionally, determining the first and second loss function may be performed simultaneously (e.g., in parallel), or subsequently. Knowledge distillation may be defined as a process where a smaller, more efficient student model is trained to replicate the behaviour of a larger, more complex teacher model. This can result in a transfer of knowledge from the teacher model to the student model. In other words, knowledge-distillation may lead to alignment of the first representation (generated by the first machine learning model) with the second representation (generated by the second machine learning model). The above method has been shown to have the advantages that the performance of the first machine learning model is increased, and that training converges faster. It is preferred to minimize both the first loss function and the second loss function at the same time, which may involve calculating both loss functions during each training step. This allows setting weights during training simultaneously such that both losses are reduced. In other words, the model is trained against two objective functions (representing the data points and alignment with the second machine learning model) simultaneously. The term "simultaneously" as used herein may mean "concurrently", "in parallel" or "successively (in close temporal relation)". Thus, the training time and the required computational resources are reduced. To obtain these advantages, the second machine learning model is frozen, i.e. no training and/or adjustment of parameters are done during execution of the method. Moreover, the method has been shown to reduce the reliance on large annotated datasets and is therefore particularly beneficial in scenarios where labelled data is scarce or expensive to obtain. The representations generated by the trained first machine learning model are more accurate and capture more details. Any training step may train the first machine learning model to generate one or more representations. In particular, the pre-training may train the first machine learning model to generate one or more initial representations of a data point comprised in the first training dataset. Training by knowledge distillation may train the first machine learning model to refine the representations, which may conceptually be similar or identical to the initial representations but generally have different values as the first machine learning model is more aligned with the second machine learning model after training. In an e