CN-121981278-A - Self-adaptive large language model reasoning device, method, medium, program and electronic terminal

CN121981278ACN 121981278 ACN121981278 ACN 121981278ACN-121981278-A

Abstract

The invention provides a self-adaptive large language model reasoning device, a method, a medium, a program and an electronic terminal, wherein the first word generation calculation route of input sequences with different length ranges is realized by setting more than two stages of first calculation submodules with different framework characteristics, so that the large language model reasoning speed under the scene of large fluctuation of the length of the input sequences is effectively improved, and the time consumption required by reasoning is reduced.

Inventors

Request for anonymity
Request for anonymity

Assignees

上海光羽芯辰科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260408

Claims (11)

1. An adaptive large language model reasoning apparatus, comprising: The system comprises a computing module, a first computing module and a second computing module, wherein the computing module comprises more than two stages of first computing submodules, and the first computing submodules at different levels have different architecture characteristics, the first computing module is configured to execute first word generation computation on a received input sequence to obtain and output a state cache and a first word element, the second computing module is configured to acquire the state cache and the first word element, and execute autoregressive generation computation to obtain a subsequent word element, the characterization parameters of the architecture characteristics comprise bandwidth level and calculation force level, and the time consumption of the first computing submodules with different architecture characteristics for computing the same input sequence is different; and the routing module is configured to receive the input sequence, determine the length range of the length of the input sequence, and distribute the input sequence to the corresponding computing module based on the length range.
2. The adaptive large language model inference apparatus of claim 1, further comprising a first computing sub-module of a corresponding level in the first computing module and the second computing module to perform data interaction in a manner of sharing a storage area or setting a swap memory.
3. The adaptive large language model inference device according to claim 2, wherein the first computing module comprises two stages of first computing sub-modules, namely a sub-module a and a sub-module B, wherein the sub-module B performs data interaction with a second computing module in a manner of sharing a storage area or setting a swap memory, the sub-module B is configured to perform first word generation computation based on GEMM kernel, and the second computing module is configured to perform autoregressive generation computation based on GEMV kernel, and the sub-module B is configured to perform first word generation computation of a shorter length input sequence.
4. The adaptive large language model inference apparatus of claim 1, wherein each length range of the input sequence is determined according to a time-consuming comparison result when different first calculation sub-modules perform initial word generation calculation on the input sequence containing different numbers of words, and wherein when the time-consuming of performing initial word generation calculation on the input sequence in a certain word number range by a first calculation sub-module of a certain stage is smaller than that of other first calculation sub-modules, the word number range is determined as the length range corresponding to the first calculation sub-module.
5. The adaptive large language model inference apparatus of claim 1, wherein the number of said first computation sub-modules per stage is one or more, and the number of said second computation sub-modules is one or more.
6. The adaptive large language model inference apparatus of claim 1, wherein the second computing module is further configured to perform first word generation computation on the received input sequence to obtain and output a state buffer and a first word element, and wherein the routing module distributes the input sequence to the first computing sub-module or the second computing module of the corresponding level based on the length range.
7. The adaptive large language model inference apparatus of claim 6, wherein the routing module is configured to, upon receiving a plurality of requests, distribute an input sequence corresponding to the request to the first computing sub-module or the second computing module of the corresponding level based on the length of the input sequence in the request if the first computing sub-module or the second computing module is in an idle state.
8. An adaptive large language model reasoning method, applied to a routing module, the method comprising: receiving an input sequence, and determining a length range to which the length of the input sequence belongs; The method comprises the steps of distributing an input sequence to a corresponding computing module based on the length range, wherein the computing module comprises a first computing module and a second computing module, the first computing module comprises more than two stages of first computing submodules, the first computing submodules with different levels have different framework characteristics, the first computing module is configured to execute first word generation computation on the received input sequence to obtain and output a state cache and a first word element, the second computing module is configured to acquire the state cache and the first word element and execute autoregressive generation computation to obtain a subsequent word element, the characterization parameters of the framework characteristics comprise bandwidth level and calculation force level, and the time consumption of the first computing submodules with different framework characteristics for computing the same input sequence is different.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of claim 8.
10. A computer program product comprising computer program code means for causing a computer to carry out the method as claimed in claim 8 when said computer program code means are run on the computer.
11. An electronic terminal comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the method of claim 8.

Description

Self-adaptive large language model reasoning device, method, medium, program and electronic terminal Technical Field The present disclosure relates to the field of large language model reasoning, and in particular, to an adaptive large language model reasoning apparatus, method, medium, program, and electronic terminal. Background When reasoning is performed on a large language model, since the bandwidth level and the computational power level are often not compatible, the following two schemes are generally adopted for hardware configuration at present: according to the existing scheme I, the hardware deployment mode of simple stripping of the first word generation calculation and the autoregressive generation calculation is adopted, namely, hardware with high calculation power level is used for processing the first word generation calculation, and hardware with high bandwidth level is used for processing the autoregressive generation calculation. In the existing scheme II, first word generation calculation and autoregressive generation calculation are simultaneously carried out by relying on unified hardware, and the hardware or the computing power level is higher or the bandwidth level is higher. When hardware with high computational power level is adopted, the time consumption for processing autoregressive calculation is large, and when hardware with high bandwidth level is adopted, the time consumption for processing first word generation calculation is large. Under complex service scenes with severe fluctuation of the length of the input sequences, such as intelligent agents, multi-round dialogue, search enhancement generation and the like, the existing scheme I can not select a calculation module with higher speed for processing the input sequences with different lengths, so that the time consumption is high. Moreover, the first word generating task and the autoregressive generating task are processed by using different hardware only, and when the state cache and the first word element generated by the first word generating link are transmitted from the hardware corresponding to the first word generating to the hardware corresponding to the autoregressive generating through a network, a long time is also required. This results in a considerable extension of the overall reasoning time. While reasoning is more time consuming with the existing scheme two than with the existing scheme one. Disclosure of Invention In view of the above-mentioned drawbacks of the prior art that reasoning takes longer in a complex traffic scenario in which the input sequence fluctuates severely, the present disclosure aims to provide an adaptive large language model reasoning apparatus, method, medium, program and electronic terminal for solving the foregoing problems. The first aspect of the disclosure provides a self-adaptive large language model reasoning device, which comprises a calculation module and a routing module, wherein the calculation module comprises a first calculation module and a second calculation module, the first calculation module comprises more than two stages of first calculation submodules, the first calculation submodules at different levels are provided with different framework features, the first calculation module is configured to execute initial word generation calculation on a received input sequence to obtain and output a state cache and an initial word element, the second calculation module is configured to acquire the state cache and the initial word element and execute autoregressive generation calculation to obtain a subsequent word element, characterization parameters of the framework features comprise bandwidth level and calculation force level, the first calculation submodules of different framework features are different in time consumption of calculating the same input sequence, and the routing module is configured to receive the input sequence, determine a length range to which the length of the input sequence belongs and distribute the input sequence to the corresponding calculation module based on the length range. In some embodiments of the present disclosure, the method further includes performing data interaction between the first computing sub-module and the second computing sub-module at corresponding levels in the first computing module in a manner of sharing a storage area or setting a swap memory. In some embodiments of the disclosure, the first computing module includes two-stage first computing sub-modules, namely a sub-module a and a sub-module B, wherein the sub-module B performs data interaction with a second computing module in a manner of sharing a storage area or setting a swap memory, the sub-module B is configured to perform first word generation computation based on a GEMM core, and the second computing module is configured to perform autoregressive generation computation based on a GEMV core, and the sub-module B is configured to perform first word generation com