CN-122019119-A - Edge environment-oriented large language model low-response delay reasoning method and device
Abstract
The invention discloses a large language model low-response delay reasoning method and device for an edge environment, and the method comprises the steps of modeling calculation time, communication time and memory occupation of a large language model reasoning system before the large language model reasoning system is deployed to obtain corresponding modeling results, constructing parallel spaces used for representing intra-layer asynchronous parallel and inter-layer asynchronous parallel mixing, executing integer programming modeling operation based on the parallel spaces to obtain integer programming modeling results, determining a deployment scheme of the large language model reasoning system for a target reasoning task based on the integer programming modeling results, deploying the large language model reasoning system based on the deployment scheme, and cooperatively executing reasoning between edge devices for the target reasoning task based on fine-granularity communication optimization operation. Therefore, the invention can improve the pre-filling throughput and reduce the pre-filling delay at the same time, thereby reducing the reasoning response delay of the large language model facing the edge environment.
Inventors
- XIONG YI
- ZHU ZONGWEI
- ZHANG RUI
- Zu yulong
Assignees
- 中国科学技术大学苏州高等研究院
Dates
- Publication Date
- 20260512
- Application Date
- 20250916
Claims (10)
- 1. An edge environment-oriented large language model low-response delay reasoning method, which is characterized by comprising the following steps: Modeling the calculation time of the large language model reasoning system before the large language model reasoning system is deployed, and obtaining a calculation time modeling result; modeling the communication time and the memory occupation of a large language model reasoning system to obtain a communication time modeling result and a memory occupation modeling result; The method comprises the steps of establishing a parallel space for representing a mixture of intra-layer asynchronous parallel and inter-layer asynchronous parallel, executing integer programming modeling operation based on the parallel space to obtain integer programming modeling results, determining a deployment scheme of a large language model reasoning system aiming at a target reasoning task based on the integer programming modeling results, and deploying the large language model reasoning system based on the deployment scheme; In the execution stage of the large language model reasoning system, the target reasoning task is cooperatively executed with the reasoning work among the edge devices based on the fine-grained communication optimization operation.
- 2. The edge-environment-oriented large language model low-response delay reasoning method of claim 1, wherein modeling the computation time of the large language model reasoning system to obtain a computation time modeling result comprises: determining a multiple quadratic regression model corresponding to the attention block for modeling based on FLOPs, the sequence length and the KV cache to obtain an attention block modeling result; Performing univariate linear regression modeling on the linear layer based on FLOPs to obtain a linear layer modeling result; determining a calculation time modeling result based on the attention block modeling result and the linear layer modeling result; And modeling the communication time and the memory occupation of the large language model reasoning system to obtain a communication time modeling result and a memory occupation modeling result, wherein the modeling method comprises the following steps: modeling the communication time of the large language model reasoning system based on the piecewise function and the regularization term to obtain a communication time modeling result; based on the slicing weight, the activation tensor and the KV cache of the large language model reasoning system, the memory occupation of the large language model reasoning system is modeled, and a memory occupation modeling result is determined.
- 3. The edge-environment-oriented large language model low-response-delay reasoning method of claim 2, wherein the constructing a parallel space for representing a mixture of intra-layer asynchronous parallelism and inter-layer asynchronous parallelism comprises: Dividing edge equipment into a plurality of groups of arrangement combinations of V x H through factorization based on a target reasoning task, wherein V and H respectively represent intra-layer parallelism and inter-layer parallelism; Based on all grid shapes, constructing a parallel space in which intra-layer asynchronous parallel and inter-layer asynchronous parallel are mixed; And executing integer programming modeling operation based on the parallel space to obtain an integer programming modeling result, wherein the integer programming modeling result comprises: And under the condition of each grid shape in the parallel space, executing integer programming modeling operation on the division of the model layer number and the subsequence length allocated to each edge device based on the pre-filling throughput function, the pre-filling delay function and the preset constraint condition, and obtaining an integer programming modeling result.
- 4. The edge-environment-oriented large language model low-response-delay reasoning method of claim 3, wherein the pre-filled throughput function is determined by: Determining the longest execution time function based on the first time required by each edge device to transmit the KV cache to the next edge device for intra-layer parallel, the second time required by the edge devices to transmit the activation data, and the calculation time of the edge devices; Determining a pre-filled throughput function based on the longest execution time function; and, the pre-filled delay function is determined by: Determining a total computation and transmission delay function within each layer of phases based on the longest computation time within a single layer of phases, the longest communication time within a single layer of phases, and the active data transmission time between adjacent layers of phases; a pre-filled delay function is determined based on the total computation and transfer delay function within each layer stage.
- 5. The edge-environment-oriented large language model low-response-delay reasoning method of claim 4, wherein determining a deployment scenario of a large language model reasoning system for a target reasoning task based on integer programming modeling results comprises: Based on a calculation time modeling result, a communication time modeling result and a memory occupation modeling result corresponding to the target reasoning task, executing integer programming modeling result calculation on the situation of each grid shape in the parallel space to obtain an optimal solution of the integer programming modeling result under the situation of each grid shape; Determining an optimal grid shape in the parallel space and the model layer number and subsequence length of each edge device in the optimal grid shape based on an optimal solution of an integer programming modeling result under the condition of each grid shape in the parallel space; and determining a deployment scheme of the large language model reasoning system aiming at the target reasoning task based on the optimal grid shape in the parallel space and the model layer number and the subsequence length of each edge device in the optimal grid shape.
- 6. The edge-environment-oriented large language model low-response-delay reasoning method of claim 4 or 5, further comprising: In the process of executing integer programming modeling result calculation on each grid shape in the parallel space to obtain the optimal solution of the integer programming modeling result under each grid shape, evaluating whether the maximum value of the pre-filling throughput corresponding to each grid shape can meet the sequence arrival rate condition of the target reasoning task or not, if the maximum value cannot meet the sequence arrival rate condition and the communication time corresponding to the grid shape cannot meet the preset communication time limiting condition, terminating the integer programming modeling result calculation of all the subsequent residual grid shapes in advance, and determining the optimal grid shape in the parallel space and the model layer number and the subsequence length of each edge device in the optimal grid shape from the grid shapes which have executed the integer programming modeling result calculation.
- 7. The edge-environment-oriented large language model low-response-delay reasoning method of claim 1, wherein when the large language model reasoning system performs matrix multiplication in an execution stage, the matrix is divided into a plurality of fragments, each of the fragments is allocated to a different thread block to be calculated in parallel so that each of the thread blocks performs a fixed number of calculations, and when the sub-sequence length is divided based on the sequence length, the sub-sequence length is set to a multiple of 32.
- 8. The edge-environment-oriented large language model low-response-delay reasoning method of claim 1, wherein the intra-layer asynchronous parallel forward propagation process is as follows: Dividing an input sequence into a plurality of subsequences, wherein each subsequence has a corresponding number; Each subsequence is allocated to an independent process, and the processes are used for independently performing linear transformation on the local input so as to generate corresponding inquiry and KV cache; For the process with the front number of each subsequence, transmitting the KV cache generated locally by the process to the process corresponding to the subsequence with the rear number, and avoiding global synchronization; for each process, attention score calculation is firstly carried out based on the KV cache generated locally, and when the KV cache from the process with the previous subsequence number is received, the rest attention score calculation is continuously completed.
- 9. An edge-environment-oriented large language model low-response delay reasoning device, characterized in that the device comprises: the system comprises a first module, a second module, a third module, a fourth module, a fifth module, a sixth module and a fourth module, wherein the first module is used for modeling the calculation time of the large language model reasoning system before the large language model reasoning system is deployed to obtain a calculation time modeling result; The system comprises a first module, a second module, a deployment scheme, a first module and a second module, wherein the first module is used for constructing a parallel space for representing the mixture of intra-layer asynchronous parallel and inter-layer asynchronous parallel; and the third module is used for cooperatively executing the reasoning work between the edge devices by the target reasoning task based on the fine-grained communication optimization operation in the execution stage of the large language model reasoning system.
- 10. An edge-environment-oriented large language model low-response delay reasoning system, the system comprising: a memory storing executable program code; a processor coupled to the memory; The processor invokes the executable program code stored in the memory to perform the steps in the edge-environment oriented large language model low response delay reasoning method of any of claims 1-8.
Description
Edge environment-oriented large language model low-response delay reasoning method and device Technical Field The invention relates to the technical field of large language models, in particular to a low-response delay reasoning method and device of a large language model facing edge environment. Background With the remarkable performance of large language models (Large Language Models, LLMs) in intent understanding and reasoning capabilities, many industries are undergoing profound intelligent changes. However, cloud-based API services do not address key issues of data privacy, customization, and continued availability that are of interest to enterprises. This has driven the trend to deploy LLM on edge environments, especially in the fields of healthcare, smart home systems, and autopilot. At the same time, recent advances in lightweight LLM have demonstrated performance comparable to the mainstream large model, further enhancing deployment feasibility. Despite these advances, the vast computational resources required for LLM reasoning continue to challenge resource-constrained edge devices, resulting in performance bottlenecks that impair the user experience. For example, users often expect LLM-based applications to respond almost instantaneously, and "Time-to-First-Token, TTFT" is a key performance indicator, which becomes an important criterion for measuring experience. If the TTFT is too long, the user may lose patience and eventually give up the interaction. To systematically investigate TTFT bottlenecks, we performed end-to-end evaluations on the NVIDIA Jetson Xavier NX platform using the representative model Qwen-1.8B-Chat, see fig. 1. The LLM reasoning process involves two distinct phases. The pre-fill stage processes the input token to generate a first output token and the decode stage iteratively generates a new tokens. This unique dual-stage workflow introduces two key Service Level Objective (SLO) related metrics, time-to-First-Token (TTFT), reflecting system responsiveness, covering request queuing delay and pre-fill computation delay, and Time-Per-Output-Token, TPOT, representing the efficiency of the decoding stage, typically requiring faster than human reading speed (about 0.2 seconds/word). LLM reasoning on real edge devices is analyzed based on these two metrics-delay measurements for Qwen-1.8B-Chat are made at NVIDIA Jetson Xavier NX. For ease of illustration, FIG. 1 sets a particular SLO number and is marked with a red dashed line. Each delay data point is taken from the average of 10 inference requests. From the analysis results it is concluded that (1) TTFT is a major bottleneck in that TTFT tends to exceed the predefined SLO in a super linear fashion as the pro length or request arrival rate increases, even up to 7 times the SLO threshold. This makes TTFT a key bottleneck in edge LLM pushing. In contrast, TPOT remained stable and within acceptable limits. This pattern is consistent with empirical observations in real systems, such as TTFT accounting for 94.4% to 98.8% of the total end-to-end reasoning delay in user interface automation tasks. Therefore, optimizing TTFT is the primary focus of this study. (2) The pre-fill delay needs to be optimized such that even in the case of very low request arrival rates (i.e., little queuing delay), when the request length exceeds about 1024 tokens, the pre-fill delay has exceeded the SLO requirement of TTFT. This phenomenon results from the fact that the computational overhead increases super-linearly with the promt length. Since user input in real world applications often exceeds this length, optimizing the pre-filled delay is critical to edge LLM reasoning. (3) The same key to pre-fill throughput optimization is that as the request load increases, the pre-fill throughput is insufficient to handle the incoming requests. This deficiency causes the queuing delay to rise significantly, which in turn causes the TTFT to increase dramatically. This problem is further exacerbated by the backlog amplification effect in that queuing delays can accumulate as the number of incoming requests increases. For example, under the same configuration shown in FIG. 1, when the request arrival rate is 1 req/s and the prompt length is 1024 tokens, increasing the number of flight requests from 10 to 100 results in an increase in average TTFT from 18 seconds to 198 seconds by a factor of about 11. The main failure modes of TTFT violations can be seen as follows: (1) Prefill delay-as the length of the input sequence increases, the prefill delay increases in a super-linear fashion. Even at very low request loads (0.01 req/s), i.e. without taking queuing delay into account, when the input reaches 4096 lemmas (tokens), the pre-fill delay has exceeded the TTFT target value by a factor of 7. (2) Prefill throughput is inadequate, queuing delay increases rapidly when the processing rate of the prefill phase (i.e., prefill throughput) fails to meet the request arrival r