Search

EP-4740137-A1 - EFFICIENT AUTOREGRESSIVE INFERENCE OF LARGE LANGUAGE MODELS (LLMS) WITH STATIC COMPILATION

EP4740137A1EP 4740137 A1EP4740137 A1EP 4740137A1EP-4740137-A1

Abstract

A processor-implemented method for adapting autoregressive inference of large language models (LLMs) for static compilation includes defining a maximally sized static data structure and a variable data mask. The variable data mask indicates a valid section of the maximally sized data structure. A set of valid data is determined by applying the variable data mask to the maximally sized static data structure. A computation is performed using only the set of valid data.

Inventors

  • VERRILLI, COLIN BEATON
  • VAIDHYANATHAN, NATARAJAN
  • BERRY, Geoffrey Carlton
  • CHATURVEDI, RISHI
  • CHATHA, KARAMVIR
  • RAMANI, SRINIVASAN
  • GUPTA, ANUJ
  • GATTUPALLI, Venkata Subba Dheeraj

Assignees

  • QUALCOMM INCORPORATED

Dates

Publication Date
20260513
Application Date
20231206

Claims (19)

  1. 1. An apparatus comprising: at least one memory; and at least one processor coupled to the at least one memory, the at least one processor configured to: define a maximally sized static data structure and a variable data mask, the variable data mask indicating a valid section of the maximally sized static data structure; determine a set of valid data by applying the variable data mask to the maximally sized static data structure; and perform a computation using only the set of valid data.
  2. 2. The apparatus of claim 1, in which the set of valid data has a variable size.
  3. 3. The apparatus of claim 1, in which the set of valid data is generated by a large language model.
  4. 4. The apparatus of claim 3, in which the set of valid data comprises an internal state associated with the large language model.
  5. 5. The apparatus of claim 4, in which the internal state is retained on an inference accelerator device.
  6. 6. The apparatus of claim 5, in which the at least one processor is further configured to suppress transferring at least one of an input or an output during a subsequent invocation of the large language model.
  7. 7. The apparatus of claim 6, in which a compiler associates the internal state with a suppressed input.
  8. 8. The apparatus of claim 3, in which the large language model is implemented on an inference accelerator.
  9. 9. The apparatus of claim 1, in which the at least one processor is further configured to generate, in advance of runtime, code to perform minimal computations using the set of valid data.
  10. 10. A processor-implemented method performed by at least one processor, the processor-implemented method comprising: defining a maximally sized static data structure and a variable data mask, the variable data mask indicating a valid section of the maximally sized static data structure; determining a set of valid data by applying the variable data mask to the maximally sized static data structure; and performing a computation using only the set of valid data.
  11. 11. The processor-implemented method of claim 10, in which the set of valid data has a variable size.
  12. 12. The processor-implemented method of claim 10, in which the set of valid data is generated by a large language model.
  13. 13. The processor-implemented method of claim 12, in which the set of valid data comprises an internal state associated with the large language model.
  14. 14. The processor-implemented method of claim 13, in which the internal state is retained on an inference accelerator device.
  15. 15. The processor-implemented method of claim 14, further comprising suppressing transfer of at least one of an input or an output during a subsequent invocation of the large language model.
  16. 16. The processor-implemented method of claim 15, in which a compiler associates the internal state with a suppressed input.
  17. 17. The processor-implemented method of claim 12, in which the large language model is implemented on an inference accelerator.
  18. 18. The processor-implemented method of claim 10, further comprising generating, in advance of runtime, code to perform minimal computations using the set of valid data.
  19. 19. An apparatus comprising: means for defining a maximally sized static data structure and a variable data mask, the variable data mask indicating a valid section of the maximally sized static data structure; means for determining a set of valid data by applying the variable data mask to the maximally sized static data structure; and means for performing a computation using only the set of valid data.

Description

EFFICIENT AUTOREGRESSIVE INFERENCE OF LARGE LANGUAGE MODELS (LLMs) WITH STATIC COMPILATION CROSS-REFERENCE TO RELATED APPLICATION [0001] The present application claims the benefit of India Patent Application No. 202341045999, filed on July 8, 2023, and titled ‘EFFICIENT AUTOREGRESSIVE INFERENCE OF LARGE LANGUAGE MODELS (LLMs) WITH STATIC COMPILATION,’' the disclosure of which is expressly incorporated by reference in its entirety. FIELD OF THE DISCLOSURE [0002] Aspects of the present disclosure generally relate to artificial neural networks, and more specifically to efficient autoregressive inference of large language models (LLMs) with static compilation. BACKGROUND [0003] Artificial neural networks may comprise interconnected groups of artificial neurons (e.g., neuron models). The artificial neural network may be a computational device or be represented as a method to be performed by a computational device. Convolutional neural networks (CNNs) are a type of feed-forward artificial neural network. Convolutional neural networks may include collections of neurons that each have a receptive field and that collectively tile an input space. Convolutional neural networks, such as deep convolutional neural networks (DCNs), have numerous applications. In particular, these neural network architectures are used in various technologies, such as image recognition, speech recognition, acoustic scene classification, keyword spotting, autonomous driving, and other classification tasks. [0004] Large language models (LLMs) have grown in popularity due to their usefulness on a wide variety of natural language processing tasks. LLMs may receive a prompt from a user, and in turn may generate a response or completion. However, LLMs may be inefficient in generating the completions. [0005] One approach to address the inefficiency is the use of a compiler. Static compilation of neural networks may generate efficient executables. This is in part because compilers work with fixed-sized tensors and static computational graphs. However, autoregressive inference on LLMs produces tensors that may vary in size. The compilation of LLMs is challenging. SUMMARY [0006] The present disclosure is set forth in the independent claims, respectively. Some aspects of the disclosure are described in the dependent claims. [0007] In some aspects of the present disclosure, a processor-implemented method performed by one or more processors includes defining a maximally sized static data structure and a variable data mask. The variable data mask indicates a valid section of the maximally sized static data structure. The processor-implemented method also includes determining a set of valid data by applying the variable data mask to the maximally sized static data structure. The processor-implemented method also includes performing a computation using only the set of valid data. [0008] Various aspects of the present disclosure are directed to an apparatus including means for defining a maximally sized static data structure and a variable data mask. The variable data mask indicates a valid section of the maximally sized static data structure. The apparatus also includes means for determining a set of valid data by applying the variable data mask to the maximally sized static data structure. The apparatus further includes means for performing a computation using only the set of valid data. [0009] In some aspects of the present disclosure, a non-transitory computer-readable medium with program code recorded thereon is disclosed. The program code is executed by a processor and includes program code to define a maximally sized static data structure and a variable data mask. The variable data mask indicates a valid section of the maximally sized static data structure. The program code also includes program code to determine a set of valid data by applying the variable data mask to the maximally sized static data structure. The program code further includes program code to perform a computation using only the set of valid data. [0010] Various aspects of the present disclosure are directed to an apparatus having at least one memory and one or more processors coupled to the at least one memory. The processor(s) is configured to define a maximally sized static data structure and a variable data mask. The variable data mask indicates a valid section of the maximally sized static data structure. The processor(s) is also configured to determine a set of valid data by applying the variable data mask to the maximally sized static data structure. The processor(s) is further configured to perform a computation using only the set of valid data. BRIEF DESCRIPTION OF THE DRAWINGS [0011] The features, nature, and advantages of the present disclosure will become more apparent from the detailed description set forth below when taken in conjunction with the drawings in which like reference characters identify correspondingly throughout. [0012] FIGURE 1 illustrates an example