EP-4736029-A1 - IMAGE QUERY PROCESSING USING LARGE LANGUAGE MODELS

EP4736029A1EP 4736029 A1EP4736029 A1EP 4736029A1EP-4736029-A1

Abstract

Implementations utilize an LLM to respond to queries comprising image data, such as multimodal queries that include both text and image data. A natural language processing system is extended such that when an image is provided, the natural language processing system invokes one or more auxiliary image processing models (e.g., visual query) and/or image search engines. The results, of invoking such model(s) and/or search engine(s), are collected into structured data signals related to the image. These signals form part of the conversation context and are used to extend the text prompt that is sent to the LLM. This allows the LLM to take the context into account when being used to process the user query, thereby enabling generation of an LLM reply that addresses relevant feature(s) of the image.

Inventors

SIEGENTHALER, Olivier
WEISZ, Ágoston
BLUNTSCHLI, BORIS
BANICA, DAN
ÖZGÜN, KAAN EGE
MOGOREANU, Daniel
SLADEK, Filip

Assignees

Google LLC

Dates

Publication Date: 20260506
Application Date: 20240813

Claims (20)

1. A method implemented by one or more processors, the method comprising: receiving an input query associated with a client device, the input query comprising an input image and an input text query, wherein the input text query refers to the input image and comprises one or more implicit queries; generating, using an explication model and based on the input text query, one or more explicit text queries that explicate one or more of the implicit queries in the input text query; processing, using a multi-modal image processing model, the input image and the one or more explicit text queries to generate one or more natural language descriptors, the one or more natural language descriptors descriptive of one or more properties of the input image, wherein the one or more natural language descriptors are responsive to the one or more explicit text queries; generating, based on the one or more natural language descriptors and the input text query and/or the one or more explicit text queries, an input prompt for a large language model, LLM; generating, from the input prompt and using the LLM, a response to the input query; and causing the response to the input query to be rendered at the client device.
2. The method of claim 1, wherein the multi-modal image processing model is a visual query answering model.
3. The method of any of claims 1 or 2, wherein generating the input prompt for the LLM comprises: completing one or more pre-defined strings using the one or more natural language descriptors.
4. The method of any preceding claim, wherein the method further comprises: processing, using one or more unimodal image processing models, the input image to generate one or more query independent properties of the input image, wherein generating the input prompt for the LLM is further based on the one or more one or more query independent properties of the input image.
5. The method of claim 4, wherein generating the one or more explicit text queries further comprises: processing, using the explication model, the one or more query independent properties of the input image.
6. The method of any of claims 4 or 5, wherein generating the input prompt for the LLM comprises: completing one or more pre-defined string using the one or more query independent properties of the input image.
7. The method of any of claims 4 to 6, wherein the one or more unimodal image processing models comprises: an object detection model; an entity recognition model; a captioning model; an optical character recognition model; and/or an image segmentation model.
8. The method of any preceding claim, wherein the input prompt for the LLM comprises contextual information indicative of contents of the image, wherein the contextual information is based on the one or more natural language descriptors.
9. The method of any preceding claim, further comprising: generating, based on the input image, a search request for a search engine; transmitting, to the search engine, the search request; receiving, from the search engine and in response to the search request, a search response, wherein generating the input prompt for the LLM is further based on the search response.
10. The method of claim 9, wherein the search request is based on the one or more natural language descriptors and/or the one or more explicit text queries.
11. The method of any of claims 9 or 10, wherein the search response comprises one or more text extracts associated with one or more images returned by the search engine in response to the search request.
12. The method of any of claims 9 to 11, wherein: the search request is an image search request requesting similar images to the input image; and the search response comprises text from one or more resources in which at least one of the images responsive to the image search request are incorporated.
13. The method of claim 12, wherein: the search response comprises the one or more resources in which the images responsive to the image search request are incorporated; and the method further comprises extracting the text from the one or more resources in which at least one of the images responsive to the image search request are incorporated
14. The method of any of claims 12 or 13, wherein the text from the one or more resources in which at least one of the images responsive to the image search request are incorporated comprises: text of one or more webpages in which at least one of the images responsive to the image search request are incorporated; text of one or more captions of at least one of the images responsive to the image search request; one or more tags of at least one of the images responsive to the image search request; and/or one or more sets of metadata of at least one of the images responsive to the image search request.
15. The method of any preceding claim, further comprising: receiving a conversation history comprising a summary of previous user interactions with the client device, wherein generating the one or more explicit text queries is further based on the conversation history.
16. The method of any preceding claim, wherein the explication model comprises the LLM or a further LLM.
17. A method implemented by one or more processors, the method comprising: receiving an input query associated with a client device, the input query comprising an input image; generating, based on the input image, an image search request for a search engine; transmitting, to the search engine, the image search request; receiving, from the search engine and in response to the image search request, a search response comprising one or more web resources containing at least one of one or more images responsive to the image search request; extracting one or more text extracts from the one or more web resources; generating, based on the one or more text extracts, an input prompt for a large language model, LLM; generating, from the input prompt and using the LLM, a response to the input query; and causing the response to the input query to be rendered at a client device.
18. The method of claim 17, wherein the one or more text extracts from the one or more web resources in which one or more of the images responsive to the image search request are incorporated comprises: text of one or more webpages in which at least one of the images responsive to the image search request are incorporated; text of one or more captions of at least one of the images responsive to the image search request; one or more tags of at least one of the images responsive to the image search request; and/or one or more sets of metadata at least one of the images responsive to the image search request.
19. The method of any of claims 17 or 18, wherein: the input query further comprises an input text query; and the search request and/or the input prompt is further based on the input text query.
20. The method of any of claims 17 to 19, wherein the method further comprises: processing, using one or more unimodal image processing models, the input image to generate one or more query independent properties of the input image, wherein generating the image search request for a search engine and/or generating the input prompt for the LLM is further based on the one or more one or more query independent properties of the input image.

Description

IMAGE QUERY PROCESSING USING LARGE LANGUAGE MODELS Background [0001] Various generative models have been proposed that can be used to process natural language (NL) content and/or other input(s), to generate output that reflects generative content that is responsive to the input(s). For example, large language models (LLM(s)) have been developed that can be used to process NL content and/or other input(s), to generate LLM output that reflects NL content and/or other content that is responsive to the input(s). For instance, an LLM can be used to process NL content of "how to change DNS settings on Acme router", to generate LLM output that reflects several responsive NL sentences such as: "First, type the router's IP address in a browser, the default IP address is 192.168.1.1. Then enter username and password, the defaults are admin and admin. Finally, select the advanced settings tab and find the DNS settings section". However, current utilizations of generative models suffer from one or more drawbacks. [0002] As one example, LLMs can be utilized as part of a text-based dialogue application, generating responses to textual inputs/queries provided by a user of the application. However, LLMs can are limited to accepting a set of tokens provided in sequence (e.g., text) as input. A user of the application may have queries relating to one or more images. These queries cannot be addressed by the LLM directly due to the limit of input types accepted by the LLM. Summary [0003] Implementations disclosed herein are directed to at least utilizing an LLM to respond to queries comprising image data, such as multimodal queries that include both text and image data. A natural language processing system is extended such that when an image is provided as part of the conversation with a chatbot, the natural language processing system invokes one or more auxiliary image processing models (e.g., visual query) and/or image search engines. The results of invoking these models/searches are collected into structured data signals related to the image. These signals form part of the conversation context and are used to extend the text prompt that is sent to the LLM. This allows the LLM to take the context into account when it is utilized in processing the user query, thereby enabling generation of an LLM reply that addresses relevant feature(s) of the image. Accordingly, the LLM is utilized to generate a response that takes the image into account. [0004] In these, and other, manners, an LLM can act as a flexible image classification and/or image querying model without necessitating specialized multi-modal training or architectural adaptations. Furthermore, text-based dialogue applications can be extended to integrate images that the user provides, providing the application with the ability to analyze the image, reason about it, and answer specific questions about images along the flow of a conversation. [0005] In some implementations, an LLM can include at least hundreds of millions of parameters. In some of those implementations, the LLM includes at least billions of parameters, such as one hundred billion or more parameters. In some additional or alternative implementations, an LLM is a sequence-to-sequence model, is Transformer-based, and/or can include an encoder and/or a decoder. One non-limiting example of an LLM is GOOGLE'S Pathways Language Model (PaLM). Another non-limiting example of an LLM is GOOGLE'S Language Model for Dialogue Applications (LaMDA). However, and as noted, it should be noted that the LLMs described herein are one example of generative machine learning models are not intended to be limiting. [0006] The preceding is presented as an overview of only some implementations disclosed herein. These and other implementations are disclosed in additional detail herein. Brief Description of the Drawings [0007] FIG. 1 depicts a block diagram of an example environment that demonstrates various aspects of the present disclosure, and in which some implementations disclosed herein can be implemented. [0008] FIG. 2 depicts an overview of an example method for responding to multimodal query. [0009] FIG. 3 illustrates an overview of an example method for responding to an image query. [0010] FIG. 4 depicts a flowchart that illustrates an example method of responding to a multimodal query. [0011] FIG. 5 depicts a flowchart that illustrates an example method of responding to an image-based query. [0012] FIG. 6 depicts an example architecture of a computing device, in accordance with various implementations. Detailed Description [0013] Turning now to FIG. 1, a block diagram of an example environment 100 that demonstrates various aspects of the present disclosure, and in which implementations disclosed herein can be implemented is depicted. The example environment 100 includes a client device 110, a natural language (NL) based response system 120, and one or more further applications 160 (i.e. applications external to an LLM o