US-12626495-B2 - Multimodal embeddings

US12626495B2US 12626495 B2US12626495 B2US 12626495B2US-12626495-B2

Abstract

Implementations relate to generating and using multimodal embeddings. In various implementations, first modality data may be obtained and encoded into first modality embedding(s) using a trained first modality encoder that is stored in memory of edge-based client device(s). Second modality data may be obtained and encoded into second modality embedding(s) using a trained second modality encoder that is also stored in the memory of the edge-based client device(s). The first and second modality embeddings may be processed using an edge-based multimodal LLM that is also stored locally in memory of the edge-based client device(s) to generate a multimodal contextual embedding, which may be provided to a remote server that hosts a central LLM, e.g., in conjunction with a natural language input provided by the user. Information generated using the central LLM, responsive to the natural language input, may be received from the remote server.

Inventors

Tuan Nguyen
Sai Aditya Chitturu
Sana Mithani
Sergei Volnov
Yunfan Ye
Alexey Galata
William A. Truong
Tzu-Chan Chuang
Liang-Yu Chen
Qiong Huang
Krunal Shah

Assignees

GOOGLE LLC

Dates

Publication Date: 20260512
Application Date: 20230905

Claims (6)

1 . A method implemented using one or more processors of one or more edge-based client devices and comprising: contemporaneously with receipt of a natural language input from a user, obtaining a first digital image captured in an environment using a front-facing camera of a given edge-based client device; encoding the first digital image into one or more first modality embeddings using a trained first modality encoder that is stored in memory of the given edge-based client device; contemporaneously with receipt of the natural language input, obtaining a second digital image captured in the environment, wherein the second digital image comprises a screenshot captured by the given edge-based client device or is captured using a rear-facing camera of the given edge-based client device; encoding the second digital image into one or more second modality embeddings using a trained second modality encoder that is stored in memory of one or more of the edge-based client devices; processing one or more of the first modality embeddings and one or more of the second modality embeddings using an edge-based multimodal large language model (LLM) that is stored locally in memory of the given edge-based client device to generate a multimodal contextual embedding; providing, to a remote server that hosts a central LLM, data indicative of the multimodal contextual embedding and the natural language input provided by the user; and receiving, from the remote server, information generated using the central LLM that is responsive to the natural language input provided by the user.
2 . The method of claim 1 , wherein the second digital image comprises the screenshot captured by the given edge-based client device.
3 . The method of claim 1 , wherein the first digital image captures a facial expression of the user, and one or more of the first modality embeddings numerically represents the captured facial expression.
4 . An edge-based system comprising one or more edge processors and memory storing instructions that, in response to execution by the one or more edge processors, cause the one or more edge processors to: contemporaneously with receipt of a natural language input from a user, obtain a first digital image captured in an environment using a front-facing camera of a given edge-based client device; encode the first digital image into one or more first modality embeddings using a trained first modality encoder that is stored in memory of the given edge-based client device; contemporaneously with receipt of the natural language input, obtain a second digital image captured in the environment, wherein the second digital image comprises a screenshot captured by the given edge-based client device or is captured using a rear-facing camera of the given edge-based client device; encode the second digital image into one or more second modality embeddings using a trained second modality encoder that is stored in memory of the given edge-based client device; process one or more of the first modality embeddings and one or more of the second modality embeddings using an edge-based multimodal large language model (LLM) that is stored locally in memory of one or more of the edge-based client devices to generate a multimodal contextual embedding; provide, to a remote server that hosts a central LLM, data indicative of the multimodal contextual embedding and the natural language input provided by the user; and receive, from the remote server, information generated using the central LLM that is responsive to the natural language input provided by the user.
5 . The system of claim 4 , wherein the first digital image captures a facial expression of the user, and one or more of the first modality embeddings numerically represents the captured facial expression.
6 . At least one non-transitory computer-readable medium comprising instructions that, in response to execution by one or more edge processors, cause the one or more edge processors to: contemporaneously with receipt of a natural language input from a user, obtain a first digital image captured in an environment using a front-facing camera of a given edge-based client device; encode the first digital image into one or more first modality embeddings using a trained first modality encoder that is stored in memory of the given edge-based client device; contemporaneously with receipt of the natural language input, obtain a second digital image captured in the environment, wherein the second digital image comprises a screenshot captured by the given edge-based client device or is captured using a rear-facing camera of the given edge-based client device; encode the second digital image into one or more second modality embeddings using a trained second modality encoder that is stored in memory of the given edge-based client device; process one or more of the first modality embeddings and one or more of the second modality embeddings using an edge-based multimodal large language model (LLM) that is stored locally in memory of one or more of the edge-based client devices to generate a multimodal contextual embedding; provide, to a remote server that hosts a central LLM, data indicative of the multimodal contextual embedding and the natural language input provided by the user; and receive, from the remote server, information generated using the central LLM that is responsive to the natural language input provided by the user.

Description

BACKGROUND Large language models (LLMs) are particular types of machine learning models—sometimes referred to as “generative models”—that can perform various natural language processing (NLP) tasks, such as language generation, machine translation, and question-answering. These LLMs are typically trained on enormous amounts of diverse data including data from, but not limited to, webpages, electronic books, software code, electronic news articles, and machine translation data. Accordingly, these LLMs leverage the underlying data on which they were trained in performing these various NLP tasks. For instance, in performing a language generation task, these LLMs can process a natural language (NL) based input that is received from a client device, and generate a NL based output that is responsive to the NL based input and that is to be rendered at the client device. Visual language models (VLMs) are a type of multimodal machine learning model that can be used to perform tasks based on multiple modalities of data, particularly visual data (e.g., digital images) in combination with NL. VLMs may be trained to facilitate performance of a variety of different tasks, such as visual question answering, text-guided image manipulation, and image captioning, to name a few. With visual question answering, for instance, input image(s) and NL question(s) about the image(s) may be assembled into a prompt that is then processed using a VLM to generate an output sequence indicative of answer(s) to the question(s). SUMMARY While visual cues have been used to invoke or “awaken” automated assistants, sometimes in combination with hot words or phrases, visual data has not typically been incorporated into ongoing conversations with automated assistants after invocation. Accordingly, implementations are described herein for using LLMs, VLMs, and/or multimodal LLMs to facilitate multimodal engagement and continued conversation with an automated assistant (also referred to as a “virtual assistant” or “chatbot”). More particularly, but not exclusively, techniques are described herein for processing multiple modalities of features, e.g., generated by multiple on-device encoders, using a local (e.g., scaled down) multimodal LLM that may be deployed on the same device and/or at the “edge,” e.g., on another device that is nearby. The multimodal LLM may then be used to generate, at the edge, a semantically rich multimodal embedding that represents a user's context. This multimodal embedding can be provided to a server for processing, e.g., along with data indicative of the user's natural language input, using a server-side LLM. In some implementations, a method may be implemented using one or more processors and may include: obtaining first modality data captured in an environment using a first modality sensor; encoding the first modality data into one or more first modality embeddings using a trained first modality encoder that is stored in memory of one or more of the edge-based client devices; obtaining second modality data captured in the environment using a second modality sensor, wherein the second modality is different than the first modality; encoding the second modality data into one or more second modality embeddings using a trained second modality encoder that is stored in memory of one or more of the edge-based client devices; processing one or more of the first modality embeddings and one or more of the second modality embeddings using an edge-based multimodal large language model (LLM) that is stored locally in memory of one or more of the edge-based client devices to generate a multimodal contextual embedding; providing, to a remote server that hosts a central LLM, data indicative of the multimodal embedding and a natural language input provided by the user; and receiving, from the remote server, information generated using the central LLM that is responsive to the natural language input provided by the user. In various implementations, the first modality data may include one or more digital images captured by one or more digital cameras. In various implementations, the one or more digital images may include one or more screenshots captured by one or more of the edge-based client devices. In various implementations, the one or more digital images may include a first digital image acquired by a front-facing camera of a given edge-based client device of the edge-based client devices. In various implementations, the one or more digital images may include a second digital image acquired by a rear-facing camera of the given edge-based client device. In various implementations, the one or more digital images may include one or more screenshots captured by one or more of the edge-based client devices. In various implementations, the first digital image captures a facial expression of the user, and one or more of the first modality embeddings numerically represents the captured facial expression. In various implementations, the seco