EP-4742185-A1 - A METHOD FOR GENERATING A DISCRIMINATIVE LARGE VISION-LANGUAGE MODEL

EP4742185A1EP 4742185 A1EP4742185 A1EP 4742185A1EP-4742185-A1

Abstract

Broadly speaking, embodiments of the present techniques provide a method for transforming a generative large vision language model, LVLM, into a discriminative large vision language model, LVLM.

Inventors

MANIADIS, Ioannis
OUALI, Yassine
BULAT, Adrian
TZIMIROPOULOS, Georgios
ZAGANIDIS, Anestis
ALONSO, Brais Martinez
XENOS, Alexandros

Assignees

Samsung Electronics Co., Ltd.

Dates

Publication Date: 20260513
Application Date: 20251021

Claims (15)

A computer-implemented method for generating a discriminative large vision language model, LVLM, for performing image-based tasks, the method comprising: obtaining a pre-trained generative LVLM for generating images and/or text, wherein the pre-trained generative LVLM comprises: a large language model, LLM, and a vision encoder; obtaining a training dataset comprising input images, wherein each input image has at least one text caption and wherein each text caption is either a short caption or a long caption; and training the pre-trained generative LVLM to perform discriminative image-based tasks, by: for each input image of the training dataset having a short caption: generating, using the LVLM, an image embedding for the input image and a text embedding for the short caption; and minimising a contrastive training loss using the generated image embedding and text embedding; and for each input image of the training dataset having a long caption: generating, using the LLM, a predicted long caption for the input image; and minimising an autoregressive training loss between the long caption for and the predicted long caption; wherein the contrastive training loss and autoregressive training loss are minimised jointly.
The method as claimed in claim 1 wherein generating an image embedding for the input image comprises: inputting, into the LLM, the input image together with a first prompt, wherein the first prompt causes the LLM to generate a first short summary of the input image; generating, using the LLM, the first short summary of the input image; and generating, using the LLM, an image embedding for the first short summary.
The method as claimed in claim 2 wherein inputting the input image together with a first prompt comprises: inputting the input image together with a first prompt to generate a first single word summary of the input image.
The method as claimed in claim 2 or 3 wherein inputting the input image comprises inputting an initial image embedding for the image into the LLM.
The method as claimed in claim 4 further comprising generating the initial image embedding by: generating, using the vision encoder, a vision embedding for the input image, wherein the vision embedding encodes vision features extracted from the input image by the vision encoder and wherein the vision embedding is in a vision embedding space; and converting, using a projector module of the LVLM, the vision embedding from the vision embedding space into the initial image embedding in a textual embedding space.
The method as claimed in any preceding claim wherein when the text caption is a short caption, generating a text embedding for the short caption comprises: inputting, into the LLM, the short caption together with a second prompt to generate a second short summary of the short caption; generating, using the LLM, the second short summary of the short caption; and generating, using the LLM, a text embedding for the second short summary.
The method as claimed in claim 6 wherein inputting the short caption together with a second prompt comprises: inputting the short caption together with a second prompt to generate a second single word summary of the short caption.
The method as claimed in any preceding claim wherein minimising a contrastive training loss comprises: calculating a similarity between the generated image embedding and text embedding for each image having a short caption; and minimising the contrastive training loss by adjusting parameters of the LLM to increase the calculated similarity between the generated image embeddings and text embeddings.
The method as claimed in any preceding claim wherein generating, using the LLM, a predicted long caption for the input image comprises: inputting, into the LLM, the input image together with a third prompt to generate a predicted long caption of the input image.
The method as claimed in claim 9 wherein generating, using the LLM, a predicted long caption for the input image comprises: selecting an initial token for the predicted long caption based on the input image and the third prompt, wherein the predicted long caption comprises a sequence of a plurality of tokens; and predicting each subsequent token in the sequence for the predicted long caption based on the previous token in the sequence.
The method as claimed in claim 10 wherein minimising an autoregressive training loss comprises: minimising an autoregressive training loss by adjusting parameters of the LLM to increase a probability that each token in the predicted long caption matches each token in the long caption for the input image.
The method as claimed in any preceding claim wherein: generating, using the LLM, an image embedding for the input image and a text embedding for the short caption comprises using a low-rank adapter for the LLM; minimising a contrastive training loss using the generated image embedding and text embedding comprises adjusting parameters of the low-rank adapter; generating, using the LLM, a predicted long caption for the input image comprises using the low-rank adapter for the LLM; and minimising an autoregressive training loss between the long caption for and the predicted long caption comprises adjusting parameters of the low-rank adapter.
A computer-implemented method for using a trained discriminative large vision language model, LVLM, on a user device, wherein the LVLM has been trained using the method of any of claims 1 to 12, the method comprising: obtaining a user query in relation to an image; generating, using the trained LVLM to process the user query, a response to the user query; and outputting the generated response.
The method as claimed in claim 13 wherein: obtaining a user query in relation to an image comprises obtaining an image and a user text query to describe the image; generating, using the trained LVLM to process the user query, a long caption; and outputting the generated long caption to a display of the user device.
The method as claimed in claim 13 wherein: obtaining a user query in relation to an image comprises obtaining a user text query to retrieve at least one image from a plurality of images stored on the user device; generating, using the trained LVLM to process the user query, a text embedding for the user text query; searching the plurality of images to identify an image having an image embedding that is most similar to the generated text embedding; and outputting the identified image to a display of the user device.

Description

Field The present application generally relates to a method for generating a discriminative large vision-language model. In particular, the present application provides a method for transforming a generative large vision language model, LVLM, into a discriminative large vision language model, LVLM. Background Contrastively-trained Vision Language Models (VLMs) (e.g. CLIP) have become the predominant direction for vision-language representation learning, exhibiting remarkable zero-shot abilities. However, the great success of these models in many vision-language and vision tasks, even in a zero-shot manner, "sweeps under the rug" some of their important limitations. Specifically, such models struggle to exhibit advanced language understanding capabilities, suffer from a limited understanding of compositionality, and manifest a bag of words behaviour. (A bag of words behaviour is where a model uses frequency of words to represent text, but in doing so, loses the semantic meaning and context of the text.) For example, even with bag of words behaviour, VLMs have shown remarkable zero-shot retrieval accuracy on the Flickr and COCO datasets. Still, they perform poorly on a simple word order permutation task on the same datasets. Unfortunately, these issues persist even when the model and the dataset size increase. Concomitantly, inspired by the success of LLMs in acting as generalist assistants, a series of works combine pretrained vision encoders and LLMs to construct Large Vision-Language Models (LVLMs) capable of performing interactive multi-modal conversations. Among others, these models have been shown capable of exhibiting strong reasoning and vision-language understanding capabilities, offering fine-grained and detailed responses. However, they are trained with a next-token prediction loss in an autoregressive manner, which appears less suitable for direct utilization in discriminative image-text tasks (e.g. image-text retrieval). A generative LVLM is one which can generate new images and/or new text from an input prompt (that may be an image or text). For example, a generative LVLM may be able to generate a new image given an input image, or may be able to generate a description or caption for the input image. A generative LVLM could generate new photos of animals that look like real animals. A discriminative LVLM is one which can discriminate between different types of input instances. For example, a discriminative LVLM could distinguish between a dog and a cat in input images. The present applicant has therefore identified the need for improved LVLMs. Summary In a first approach of the present techniques, there is provided a computer-implemented method for generating a discriminative large vision language model, LVLM, for performing image-based tasks, the method comprising: obtaining a pre-trained generative LVLM for generating images and/or text, wherein the pre-trained generative LVLM comprises: a large language model, LLM, and a vision encoder; obtaining a training dataset comprising input images, wherein each input image has at least one text caption and wherein each text caption is either a short caption or a long caption; and training the pre-trained generative LVLM to perform discriminative image-based tasks, by: for each input image of the training dataset having a short caption: generating, using the LVLM, an image embedding for the input image and a text embedding for the short caption; and minimising a contrastive training loss using the generated image embedding and text embedding; and for each input image of the training dataset having a long caption: generating, using the LLM, a predicted long caption for the input image; and minimising an autoregressive training loss between the long caption for and the predicted long caption; wherein the contrastive training loss and autoregressive training loss are minimised jointly. Advantageously, the present techniques overcome the problems with existing vision-language models and provide an LVLM which has enhanced language understanding and can be used for generative image-based tasks (e.g. image generation from text prompts) and discriminative image-based tasks (e.g. image retrieval or image description/captioning). The present techniques solve the problems by introducing two training losses - one based on short captions and one based on long captions. As explained in more detail below, the short captions are used to ensure that the LVLM is able to correctly understand images, by ensuring a text embedding produced by the LLM for the short caption is similar to/substantially matches an image embedding produced by the LLM (aided by the vision encoder). Thus, the LLM is trained to understand both text and image features. The long captions are used to ensure that the LVLM does not exhibit bag of words behaviour and has better semantic understanding of images. By optimising both losses jointly, the resulting trained LVLM is able to capture and summarise