Search

CN-121986346-A - Efficient decoding using large and small generative artificial intelligence models

CN121986346ACN 121986346 ACN121986346 ACN 121986346ACN-121986346-A

Abstract

Certain aspects of the present disclosure provide techniques and apparatus for generating a response to an input query using a generative artificial intelligence model. The method generally includes receiving an input query for processing. An embedded representation of the received input query is generated using a first generative artificial intelligence model. The embedded representation generally includes an embedding of the received input query in a first dimension. The embedded representation is projected into a projected representation of the received input query. In general, the projected representation includes a representation in a second dimension. A second generative artificial intelligence model and projected representation are used to generate a response to the received input query and to output the generated response.

Inventors

  • B. Bergner
  • A. Skriar
  • B. Etshami Benodi
  • Y. M. Asano
  • T. P.F. brackwater
  • J.B. Soliaga

Assignees

  • 高通股份有限公司

Dates

Publication Date
20260505
Application Date
20240725
Priority Date
20231013

Claims (20)

  1. 1. A processing system, the processing system comprising: at least one memory having executable instructions stored thereon, and One or more processors configured to execute the executable instructions to cause the processing system to: Receiving an input query for processing; generating an embedded representation of the received input query in a first dimension using a first generated artificial intelligence model; Projecting the embedded representation of the received input query into a projected representation of the received input query, wherein the projected representation comprises a representation in a second dimension; Generating a response to the received input query using the second generated artificial intelligence model and the projected representation, and The generated response is output.
  2. 2. The processing system of claim 1, wherein the first generative artificial intelligence model comprises a model having a number of parameters that is greater than a number of parameters included in the second generative artificial intelligence model.
  3. 3. The processing system of claim 1, wherein the first and second generative artificial intelligence models comprise models trained together on a same target task such that the first and second generative artificial intelligence models are trained on a same number of lemmas.
  4. 4. The processing system of claim 1, wherein to generate the response using the second generative artificial intelligence model and the projected representation, the one or more processors are configured to cause the processing system to autoregressively generate a first set of tokens including a threshold number of tokens.
  5. 5. The processing system of claim 4, wherein the one or more processors are further configured to cause the processing system to: Generating, using the first generated artificial intelligence model, an updated embedded representation comprising the received input query and the embedding of the generated first set of tokens in the first dimension; Projecting the update-embedded representation into an update-embedded representation of the received input query and the generated projection of the first set of tokens in the second dimension, and A second set of lemmas including the threshold number of lemmas is generated using the second generated artificial intelligence model and the projected updated embedded representation.
  6. 6. The processing system of claim 1, wherein to generate the response using the second generated artificial intelligence model and the projected representation, the one or more processors are configured to cause the processing system to: generating one or more first tokens based on the projected representation; Concatenating the projected representation with information related to the generated one or more first tokens, and A second token is generated based on the concatenated projection representation and the information related to the generated one or more first tokens.
  7. 7. The processing system of claim 6, wherein to concatenate the projected representation and the information related to the generated one or more first tokens, the one or more processors are configured to cause the processing system to concatenate the projected representation and an embedded representation of the one or more first tokens.
  8. 8. The processing system of claim 1, wherein to generate the response using the second generated artificial intelligence model and the projected representation, the one or more processors are configured to cause the processing system to: generating one or more first tokens based on the projected representation; projecting a combination of the embedded representation and the one or more first tokens into a received input query and a projected representation of the one or more first tokens in the second dimension, and A second token is generated based on the received input query and the projected representation of the one or more first tokens.
  9. 9. The processing system of claim 1, wherein: The first generative artificial intelligence model includes a large language model trained to generate the embedded representation in the first dimension, and The second generative artificial intelligence model includes a small language model trained to generate a response based on input in the second dimension.
  10. 10. The processing system of claim 1, wherein the second dimension is smaller than the first dimension.
  11. 11. A processor-implemented method, the processor-implemented method comprising: Receiving an input query for processing; generating an embedded representation of the received input query in a first dimension using a first generated artificial intelligence model; Projecting the embedded representation of the received input query into a projected representation of the received input query, wherein the projected representation comprises a representation in a second dimension; Generating a response to the received input query using the second generated artificial intelligence model and the projected representation, and The generated response is output.
  12. 12. The method of claim 11, wherein the first generative artificial intelligence model comprises a model having a number of parameters that is greater than a number of parameters included in the second generative artificial intelligence model.
  13. 13. The method of claim 11, wherein the first and second generative artificial intelligence models comprise models trained together on a same target task such that the first and second generative artificial intelligence models are trained on a same number of lemmas.
  14. 14. The method of claim 11, wherein generating the response using the second generated artificial intelligence model and the projected representation comprises autoregressively generating a first set of tokens including a threshold number of tokens.
  15. 15. The method of claim 14, the method further comprising: Generating, using the first generated artificial intelligence model, an updated embedded representation comprising the received input query and the embedding of the generated first set of tokens in the first dimension; Projecting the update-embedded representation into an update-embedded representation of the received input query and the generated projection of the first set of tokens in the second dimension, and A second set of lemmas including the threshold number of lemmas is generated using the second generated artificial intelligence model and the projected updated embedded representation.
  16. 16. The method of claim 11, wherein generating the response using the second generative artificial intelligence model and the projected representation comprises: generating one or more first tokens based on the projected representation; Concatenating the projected representation with information related to the generated one or more first tokens, and A second token is generated based on the concatenated projection representation and the information related to the generated one or more first tokens.
  17. 17. The method of claim 16, wherein concatenating the projected representation and the information related to the generated one or more first tokens includes concatenating the projected representation and an embedded representation of the one or more first tokens.
  18. 18. The method of claim 11, wherein generating the response using the second generative artificial intelligence model and the projected representation comprises: generating one or more first tokens based on the projected representation; projecting a combination of the embedded representation and the one or more first tokens into a received input query and a projected representation of the one or more first tokens in the second dimension, and A second token is generated based on the received input query and the projected representation of the one or more first tokens.
  19. 19. The method according to claim 11, wherein: The first generative artificial intelligence model includes a large language model trained to generate the embedded representation in the first dimension, and The second generative artificial intelligence model includes a small language model trained to generate a response based on input in the second dimension.
  20. 20. The method of claim 11, wherein the second dimension is smaller than the first dimension.

Description

Efficient decoding using large and small generative artificial intelligence models Cross Reference to Related Applications The present application claims priority from U.S. patent application Ser. No. 18/486,653, filed on 10/13 of 2023, which is hereby incorporated by reference. Background Aspects of the present disclosure relate to generating artificial intelligence models. The generative artificial intelligence model may be used in a variety of environments to generate responses to input queries. For example, a generative artificial intelligence model may be used in chat bots applications in which a Large Language Model (LLM) is used to generate answers, or at least responses, to an input query. Other examples of generative artificial intelligence models that may be used include stable diffusion, where the model generates images from an input textual description of the content of the desired image, and decision transformers, where future actions are predicted based on a sequence of previous actions within a given environment. In general, the use of generative artificial intelligence models to generate responses to queries can be computationally expensive. For example, in a chat robot deployment that uses a large language model to generate a response to a query formatted as a literal query, a response to the query may be generated by a pass of the large language model using each of the tokens (e.g., words or portions of words) generated as part of the response. For example, the output of each pass may be a probability distribution of a set of tokens (words) from which the next token (word) may be selected by sampling or based on maximum likelihood. Because each word (term) is generated using a pass through a large language model in response to a query, the computational cost may be modeled as a product of the number of words included in the response and the computational resource cost of performing the pass through the large language model (e.g., in terms of processing power, memory bandwidth, and/or other computational resources used), which generally increases as the number of parameters within the large language model increases. Disclosure of Invention Certain aspects of the present disclosure provide a processor-implemented method for generating a response to an input query using a generative artificial intelligence model. The method generally includes receiving an input query for processing. An embedded representation of the received input query is generated using a first generative artificial intelligence model. The embedded representation generally includes an embedding of the received input query in a first dimension. The embedded representation is projected into a projected representation of the received input query. In general, the projected representation includes a representation in a second dimension, and the second dimension is smaller than the first dimension. A response to the received input query is generated using the second generative artificial intelligence model and the projected representation, and the generated response is output. Other aspects provide a processing system configured to perform the foregoing methods and those described herein, a non-transitory computer-readable medium comprising instructions that, when executed by one or more processors of the processing system, cause the processing system to perform the foregoing methods and those described herein, a computer program product embodied on a computer-readable storage medium comprising code for performing the foregoing methods and those described further herein, and a processing system comprising components for performing the foregoing methods and those described further herein. The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects. Drawings The drawings depict only certain aspects of the present disclosure and are not therefore to be considered limiting of its scope. FIG. 1 illustrates an example pipeline for generating a response to an input using a large-scale generative artificial intelligence model and a small-scale generative artificial intelligence model, in accordance with aspects of the present disclosure. FIG. 2 illustrates an example of iteratively generating word tuples using a large-scale generative artificial intelligence model and a small-scale generative artificial intelligence model, in accordance with aspects of the present disclosure. FIG. 3 illustrates an example pipeline for generating a response to an input using a large language model and a small language model, in accordance with aspects of the present disclosure. FIG. 4 illustrates an example pipeline for generating a response to an input using a large language model and a small language model, in accordance with aspects of the present disclosure. FIG. 5 illustrates example operations for generating a response to an input query using a large-scale generated artificial intel