US-12619654-B2 - Language model for processing a multi-mode query input

US12619654B2US 12619654 B2US12619654 B2US 12619654B2US-12619654-B2

Abstract

A query processing system is described which receives a query input comprising an input token string and also at least one data item having a second, different modality, and generates a corresponding output token string.

Inventors

Jean-Baptiste Alayrac
Yana Elizabeth Hasson
Katherine Elizabeth Millican
Roman Ring
Jeffrey Donahue
Karel Lenc
Karen Simonyan
Malcolm Kevin Campbell Reynolds
Pauline LUC
Arthur Mensch
Iain Barr
Antoine Miech

Assignees

GDM HOLDING LLC

Dates

Publication Date: 20260505
Application Date: 20230428

Claims (20)

1 . A computer-implemented method of training a query processing system, the query processing system being for generating an output token string based on a query input comprising an input token string and one or more data items, the input token string and output token string being strings of tokens selected from a token vocabulary, and the data items being of a modality other than tokens selected from the token vocabulary, the method employing a token processing model comprising a stack of token processing layers, the stack of token processing layer being configured to receive input token strings and to generate corresponding output token strings, and a database of training examples, each training example comprising at least one data item and at least one token string; the method comprising: forming a data-item-token processing model by interleaving token processing layers from a token processing model with gated cross-attention layers, the data-item-token processing model being configured to generate an output token string upon receiving a prompt input which is a token string, the token processing model comprising a stack of the token processing layers, the stack of token processing layers being configured to receive input token strings and to generate corresponding output token strings, and a database of training examples, each training example comprising at least one data item and at least one token string; forming the query processing system, the query processing system comprising: (a) a modality network configured to receive the data items of the query input, to generate one or more compressed representations of each data item; and (b) the data-item-token processing model, the data-item-token processing model being configured to receive a prompt input comprising the input token string of the query input, and each gated cross-attention layer being arranged to receive at least one of the compressed representations; and using the training database, training: the modality network, and the plurality of gated cross-attention layers.
2 . The computer-implemented method of claim 1 in which the training trains the query processing system, upon an encoder of the modality network receiving the at least one data item of any of the training examples, and the data-item-token processing model receiving a prompt input comprising a first portion of the token string of the training example, to generate an output of the query processing system which is positively statistically correlated with a subsequent portion of the token string of the training example.
3 . The computer-implemented method of claim 1 in which the modality network comprises: an encoder configured to encode a data item received by the encoder to generate an encoded data item, and a compressed representation generation system arranged to receive the encoded data item and generate an output, the output of the modality network being based on the output of the compressed representation generation system.
4 . The computer-implemented method of claim 3 , in which the encoder has been trained to encode a data item received by the encoder to generate an encoded data item, and the training of the modality network and the plurality of gated cross-attention layers comprises training the compressed representation generation system without further training the encoder.
5 . The computer-implemented method of claim 3 , in which the compressed representation generation system comprises a stack of one or more resampler layers, each resampler layer being adapted to perform an attention operation which employs a key vector, a value vector and a query vector, a subset of the key vector, value vector and query vector being based on the encoded data item, and the remainder of the key vector, value vector and query vector being based on either an output of the preceding one of the resampler layers or, in the case of the first resampler layer of the stack, a set of input latent values, the output of the modality network being based on an output of the last resampler layer of the stack of resampler layers.
6 . The computer-implemented method of claim 5 in which the key vector and value vector of each resampler layer are based on the encoded data item and a latent input which is either the output of the preceding one of the resampler layers or, in the case of the first resampler layer of the stack, the set of input latent values, and the query vector is based on the latent input.
7 . The computer-implemented method of claim 5 in which each resampler layer further comprises a perceptron arranged to receive the output of the attention operation, and to generate an output, the output of the modality network being based on the output of the perceptron of the last resampler layer of the stack.
8 . The computer-implemented method of claim 1 , in which the prompt input further comprises one or more corresponding marker items for each data item in the query input, the one or more marker items being indicative of the presence of the data item in the query input.
9 . The computer-implemented method of claim 8 in which a position of each marker item in the prompt input is indicative of a position of the corresponding data item in the query input.
10 . The computer-implemented method of claim 1 in which each gated cross-attention layer generates its output as a component-wise sum of: a first input which is the output of the preceding processing layer in the stack of processing layers or, in the case that the gated cross-attention layer is the first processing layer of the stack of processing layers, the prompt input, and an interaction term based on the output of the compressed representation generation system received by the gated cross-attention layer, and at least part of the first input to the gated cross-attention layer.
11 . The computer-implemented method of claim 10 , in which the interaction term has a magnitude which depends positively upon the value of a gating parameter, the training comprising incrementally increasing the learning parameter.
12 . The computer-implemented method of claim 10 which includes, in the case of a query input comprising a plurality of portions, each portion comprising one of the data items, for each portion: the modality network generating at least one respective compressed representation of the corresponding data item, and at least one of the gated cross-attention layers generating the interaction term based only on the compressed representation of the corresponding data item and without employing data generated based on tokens of the input token string other than within the portion.
13 . The computer-implemented method of claim 1 in which each gated cross-attention layer comprises a cross-attention layer, which employs a key vector, a value vector and a query vector, a subset of the key vector, value vector and query vector being based on the at least one compressed representation received by the gated cross-attention layer, and the remainder of the key vector, value vector and query vector being based on the output of the preceding processing layer in the stack of processing layers or, in the case that the gated cross-attention layer is the first processing layer of the stack of processing layers, based on the prompt input.
14 . The computer-implemented method of claim 13 in which the key vector and value vector of each gated cross-attention layer are obtained based on the at least one compressed representation received by the gated cross-attention layer, and the query vector of each gated cross-attention layer is based on the output of the preceding processing layer in the stack of processing layers or, in the case that the gated cross-attention layer is the first processing layer of the stack of processing layers, based on the prompt input.
15 . The computer-implemented method of claim 13 in which the gated cross-attention layer further comprises a perceptron which receives the output of the cross-attention layer, the output of the gated cross-attention layer being based on an output of the perceptron.
16 . A computer-implemented method of generating an output token string based on a query input comprising an input token string and one or more data items, the input token string and output token string being strings of tokens selected from a token vocabulary, and the data items being of a modality other than tokens selected from the token vocabulary, the method comprising: (a) generating one or more compressed representations of each data item by processing the data item using a modality network which comprises: an encoder configured to encode the data item to generate an encoded data item, and a compressed representation generation system arranged to receive the encoded data item and generate an output, wherein the compressed representation generation system comprises a stack of one or more resampler layers, each resampler layer being configured to perform an attention operation which employs a key vector, a value vector and a query vector, the key vector, value vector and query vector each being based on at least one of the encoded data item and a latent input which is either an output of the preceding one of the resampler layers or, in the case of the first resampler layer of the stack, a set of input latent values, at least one of the key vector, value vector and query vector being based on both the encoded data item and the latent input, the output of the modality network being based on an output of the last resampler layer of the stack of resampler layers; (b) generating a prompt input comprising the input token string of the query input; and (c) inputting the prompt input and the compressed representation of each data item to a data-item-token processing model configured to generate the output token string based on the prompt input and the compressed representation of each data item.
17 . The computer-implemented method of claim 16 in which the output token string is the response to a query about the content of a subject data item which is one of the data items in the query input, the query being defined based on the input token string.
18 . The computer-implemented method of claim 17 in which the query input comprises, in addition to the subject data item, one or more task example portions which each include a respective data item and a respective section of the input token string, and for each task example portion the respective section of the input token string is the response to the query when the query is about the content of the respective data item.
19 . The computer-implemented method of claim 17 , wherein the query input is a question and the response to the query is an answer to the question.
20 . The computer-implemented method of claim 16 in which the token vocabulary comprises the symbols of a natural language writing system.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application claims priority to U.S. Provisional Application No. 63/336,192, filed on Apr. 28, 2022. The disclosure of the prior application is considered part of and is incorporated by reference in the disclosure of this application. BACKGROUND This specification relates to a neural network configured to process a multi-mode query input (e.g. a mixture of text and sound/image(s)), to generate an output which is a response to the query input. Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. Language models employing neural networks are known which, upon receiving an input token string which is a sequence of tokens selected from a token vocabulary (e.g. a piece of text composed of letters selected from an alphabet (e.g. a natural language alphabet), a piece of text composed of subwords or word pieces selected from a corresponding vocabulary, a piece of text composed of phonemes selected from a corresponding vocabulary, and so on), generate an output token string (another sequence of tokens selected from the token vocabulary) which is a sensible response to the input token string, e.g. a plausible continuation of the input token string, or an answer to a question posed by the input token string. SUMMARY The present disclosure describes a system (a “query processing system”), implemented as computer programs in one or more computers in one or more locations, which receives a query input comprising an input token string and also at least one data item having a second, different modality. For example, the data item(s) may be image(s). Each data item may be a still image—e.g. the data item may be pixel values (e.g. red-green-blue (RGB) values) for each pixel of a pixel array. Alternatively, one or more of the data items may be video images—e.g. the data may be, for each of multiple frames, pixel values for each pixel of a respective array. The image(s) may be captured by imaging the real world, by a still or video camera. In another possibility, the data item(s) may be sound signal(s). A sound signal is audio data representing values of an audio waveform at each of a plurality of times, e.g. the sound captured by a microphone during a period of time. In a further possibility, the data items may be video images with an accompanying respective soundtrack. The query processing system is operative to generate an output token string based on the query input. The input token string and output token string are each sequences of tokens, selected from a (single) token vocabulary (e.g. both strings may be strings of letters from the Roman alphabet, strings of subwords or word pieces from a word piece vocabulary, and so on). The token vocabulary may be the token vocabulary of a natural language (e.g. Roman letters in the case of English, or Roman letters plus Roman letters with accents in the case of French). In general terms, the disclosure suggests that the query processing system extracts the data item(s) from the query input, and inputs them to a modality network which is configured to generate from them one or more corresponding compressed representations of each data item. The query processing system uses the input token string to generate a prompt input for a data-item-token processing model. The data-item-token processing model comprises a stack of processing layers including token processing layers, and gated cross-attention layers interleaved with the token processing layers. The prompt input may be supplied to the first processing layer of the stack. The gated cross-attention layers each receive at least one of the compressed representations, and perform a gating operation based on the received compressed representations. The output token string is the output of the data-item-token processing model. The output token string is a sensible response to the query input. Specifically, the input token string may at least partly define a question about the content of at least one of the data items in the query input (the “subject data item”), and the output string may be a (correct) response to the question. Thus, the question defines a data item processing task to be carried out on the subject data item. The query processing system may be regarded as a generalization of a classifier neural network. A classifier neural network typically receives data items, and determines which of a pre-determined plurality of classes the data item belongs to (e.g. if the data item is an image showing an object, the classifier may output data indicating which