CN-122021900-A - Efficient reasoning method and framework of large language model for resource-constrained end-side equipment

CN122021900ACN 122021900 ACN122021900 ACN 122021900ACN-122021900-A

Abstract

The application discloses a large language model efficient reasoning method and framework for resource-constrained end-side equipment, and belongs to the technical field of Web operating systems. The framework solves the problem of adaptation of tasks and models through cooperative work of a task scheduling layer, an inference engine layer and a resource management layer, the inference engine layer improves inference throughput while reducing the occupation peak value of a Graphic Processing Unit (GPU) video memory through draft model unloading, double-batch staggered assembly line and asynchronous loading, and the resource management layer monitors CPU, GPU and memory resources in real time, so that data support is provided for dynamic adjustment strategies of the inference engine layer, and resource overload or waste is avoided. The framework realizes high video memory efficiency on the data set, and provides practical framework support for LLM deployment of domestic end side software and hardware ecology.

Inventors

LIU XIAODONG
XU HAO
JI BIN
YU JIE
ZHANG QINGXIAO
LI XIAOPENG
GAO LONG
PENG LONG
LI ZHUO
ZHANG YI

Assignees

中国人民解放军国防科技大学

Dates

Publication Date: 20260512
Application Date: 20260129

Claims (9)

1. A large language model LLM efficient reasoning framework for resource constrained end-side devices, the framework comprising: the task scheduling layer is configured with an reasoning request receiving unit and a reasoning task disassembling unit, and is used for receiving a LLM (logical link management) reasoning request at the side of the terminal and disassembling the reasoning task into subtasks which are matched with the dual-mode cooperative processing; The reasoning engine layer is provided with a dual-model collaborative reasoning unit, a dual-batch staggered pipeline unit, a draft model unloading unit and an asynchronous loading unit, wherein the dual-model collaborative reasoning unit is used for executing draft model prediction and target model verification, the dual-batch staggered pipeline unit is used for carrying out staggered processing on two sequences so as to improve reasoning efficiency, the draft model unloading unit is used for unloading MLP components and KVcache of a draft model to a CPU (central processing unit) memory, and the asynchronous loading unit is used for pre-fetching the model components in the calculation process; The resource management layer comprises a computing resource monitoring unit and a memory resource monitoring unit, and is used for monitoring the CPU utilization rate, the GPU video memory occupancy rate and the memory occupancy rate of the terminal side equipment in real time and communicating with the data of the reasoning engine layer to dynamically adjust the reasoning resource allocation.
2. The efficient inference framework of a large language model LLM for a resource-constrained end-side device according to claim 1, wherein in a draft model unloading unit of the inference engine layer, a draft model is a misral type non-MoE architecture model, in the unloading process, LLM attention calculation tasks are delegated to a CPU for execution, a current layer activation value Query is transmitted from the GPU to the CPU, and a result is returned to the GPU after the calculation is completed.
3. The large language model LLM efficient reasoning framework for resource-constrained end-side devices of claim 1, wherein the reasoning engine layer is further provided with a fine-grained component loading unit for prefetching gating network and expert components in parallel through a multithreading technology for each MoE layer of a target MoE architecture model; Each MoE layer of the target MoE architecture model comprises 8 expert components, 9 threads are started by the multithreading technology to load 1 gating network and 8 expert components respectively, and the expert components refer to independent model functional modules for executing specific reasoning calculation tasks.
4. The large language model LLM efficient reasoning framework for resource-constrained end-side devices of claim 1, wherein the reasoning engine layer is further provided with a Load-to-GPUPinning unit for opening up a physically continuous lock page memory area in the host memory, and permanently residing weights of MLP components and MoE components in the area; The size of the lock page memory area is 10.83GB, and the lock page memory area supports zero copy through transmission realized through a tensor.to function for direct access of DMA, wherein the parameter setting of the tensor.to function comprises the setting of a device parameter and the setting of non-blocking=true; the Load-to-GPUPinning unit performs memory locking, including: A10.83 GB physical continuous region is divided in a host memory to serve as a page locking buffer region, the MLP and MoE component weights are stored in the region, zero copy transmission of the component weights is realized by DMA (direct access page locking buffer) in the reasoning process, and paging table inquiry and double copy overheads are eliminated.
5. The efficient inference framework of large language model LLM for resource-constrained end-side devices of claim 1, wherein the monitoring frequencies of the computing resource monitoring unit and the memory resource monitoring unit of the resource management layer are real-time, wherein, The monitored data are used for acquiring resource use conditions, including CPU core use rate, GPU video memory occupation peak value and memory occupation rate of the terminal equipment.
6. A large language model efficient reasoning method facing to a resource-constrained end-side device, which is applicable to the large language model efficient reasoning framework facing to the resource-constrained end-side device as claimed in any one of claims 1 to 5, and is characterized in that the method comprises the following steps: receiving a terminal LLM (logical link management) reasoning request through a task scheduling layer of the framework, and dismantling the reasoning task into subtasks which are matched with the double-mode cooperative processing by a reasoning task dismantling unit; The double-batch staggered pipeline unit of the reasoning engine layer controls the double-model collaborative reasoning unit to execute draft model prediction and target model verification in a sequence staggered mode, and simultaneously, MLP components and KVcache of a draft model are unloaded to a CPU memory through a draft model unloading unit, and an asynchronous loading unit prefetches model components required by reasoning; The resource management layer monitors CPU utilization rate, GPU video memory occupancy rate and memory occupancy rate of the terminal equipment in real time, and feeds back monitoring data to the reasoning engine layer to dynamically adjust draft model unloading, asynchronous loading and component loading strategies.
7. The efficient reasoning method of the large language model for the resource-constrained end-side device according to claim 6, wherein when the dual-model collaborative reasoning unit performs draft model prediction and target model verification, dual-batch interleaving is adopted; the double-batch interleaving process comprises: for two sequences to be inferred, the draft model firstly processes the prediction subtask of the sequence 1, the target model synchronously verifies the prediction result of the sequence 2, and then alternately processes the prediction of the sequence 2 and the verification of the sequence 1.
8. The efficient inference method of a large language model for a resource-constrained end-side device according to claim 6, wherein the draft model offloading unit is configured to asynchronously prefetch weights of a current layer MLP component into a GPU memory while a CPU performs attention computation, so as to implement overlapping of computation and weight transmission, where the attention computation only transmits a current layer Query activation value, and the data amount O conforms to a computation rule regarding a batch size b, a sequence length s, and a hidden layer dimension d.
9. The method for efficient inference of a large language model for a resource-constrained end-side device of claim 6, wherein when the inference engine layer performs fine-grained component loading, the method further comprises: for each MoE layer of the target MoE architecture model, loading in parallel through 9 independent threads, wherein 1 thread loads a gating network component, and 8 threads respectively load 8 expert components; and starting corresponding calculation after each component is loaded.

Description

Efficient reasoning method and framework of large language model for resource-constrained end-side equipment Technical Field The embodiment of the application relates to the technical field of Web operating systems, in particular to a large language model efficient reasoning method and framework for resource-constrained end-side equipment. Background With the advent of the AIPC era, the demand for localization reasoning of a large language model (LargeLanguageModel, LLM) on an end-side device (smart phone, embedded development board, etc.) has increased, which can avoid the privacy risk of cloud transmission and reduce the reasoning delay to millisecond level, and is the core direction for realizing efficient intelligent interaction, but the end-side device generally faces multiple limitations of computing power, storage and hardware architecture adaptation. In particular, the problem of end-side device resource limitation is highlighted. The computing power is only 1/10-1/100 of that of the cloud server, the memory is mostly 2-16GB, the memory is 32-256GB, the original volume of the main stream LLM exceeds 20GB, the temporary memory occupation is high, meanwhile, the device covers ARM, x86, RISC-V and other multi-framework, the instruction set support difference is large, and the LLM reasoning adaptation difficulty is obviously increased. In the existing scheme, if the model is light and lacks architecture pertinence, the resource scheduling adopts a fixed allocation mode and has delayed response, reasoning execution and result multiplexing are not enough, and the hardware compatibility adaptation is lack of verification and reprocessing mechanisms, so that the problems of difficult adaptation, low utilization, poor efficiency and weak compatibility cannot be solved. Thus, there is a need for efficient LLM reasoning schemes that adapt to end-side characteristics. Disclosure of Invention The embodiment of the application provides a large language model efficient reasoning method and framework for resource-constrained end-side equipment. The technical scheme is as follows: In one aspect, a large language model efficient reasoning framework for a resource-constrained end-side device is provided, the framework comprising: the task scheduling layer is configured with an reasoning request receiving unit and a reasoning task disassembling unit, and is used for receiving a LLM (logical link management) reasoning request at the side of the terminal and disassembling the reasoning task into subtasks which are matched with the dual-mode cooperative processing; The reasoning engine layer is provided with a dual-model collaborative reasoning unit, a dual-batch staggered pipeline unit, a draft model unloading unit and an asynchronous loading unit, wherein the dual-model collaborative reasoning unit is used for executing draft model prediction and target model verification, the dual-batch staggered pipeline unit is used for carrying out staggered processing on two sequences so as to improve reasoning efficiency, the draft model unloading unit is used for unloading a Multi-layer perceptron (Multi-LayerPerceptron, MLP) component and a Key value cache (Key-ValueCache, KVcache) of a draft model to a CPU (central processing unit) memory, and the asynchronous loading unit is used for prefetching the model component in the calculation process; The resource management layer comprises a computing resource monitoring unit and a memory resource monitoring unit, and is used for monitoring the CPU utilization rate, the GPU video memory occupancy rate and the memory occupancy rate of the terminal side equipment in real time and communicating with the data of the reasoning engine layer to dynamically adjust the reasoning resource allocation. Optionally, in the draft model unloading unit of the reasoning engine layer, the draft model is a architecture model of a Mistral type non-mixed expert model (MixtureofExperts, moE), in the unloading process, the LLM attention calculation task is entrusted to the CPU for execution, a current layer activation value (Query) is transmitted from the GPU to the CPU, and after the calculation is completed, the result is transmitted back to the GPU. Optionally, the reasoning engine layer is further provided with a fine-grained component loading unit, and the fine-grained component loading unit is used for prefetching the gating network and the expert component in parallel through a multithreading technology for each MoE layer of the target MoE architecture model; Each MoE layer of the target MoE architecture model comprises 8 expert components, 9 threads are started by the multithreading technology to load 1 gating network and 8 expert components respectively, and the expert components refer to independent model functional modules for executing specific reasoning calculation tasks. Optionally, the reasoning engine layer is further provided with a Load-to-GPUPinning unit, which is used for opening up a phy