CN-122019197-A - Large model mobile terminal reasoning method supporting MoE slice and video memory virtualization

CN122019197ACN 122019197 ACN122019197 ACN 122019197ACN-122019197-A

Abstract

The invention discloses a large model mobile terminal reasoning method supporting MoE slice and video memory virtualization, belonging to the field of artificial intelligent computation. The method is realized by cooperation of a dynamic MoE slice scheduler and a unified memory-memory virtualization manager, the scheduler evaluates hardware and model states in a multi-dimensional mode, dynamically determines expert activation and loading modes, cuts out secondary experts when the memory is insufficient, maintains a hot, warm and cold hierarchical memory pool and performs data prefetching and replacement, and the expert slices of different computing cores complete topology perception communication through an on-chip high-speed bus. The invention realizes adaptive scheduling of the video memory, is lossless and can smoothly degrade, reduces reasoning time delay and power consumption, improves resource utilization rate, flexibly adjusts a storage structure and a slice object, adapts to various mobile hardware, and is suitable for a mobile terminal large model reasoning scene.

Inventors

GUO SHA
CEN JIE

Assignees

深圳行胜数字技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. The large model mobile terminal reasoning method supporting MoE slice and video memory virtualization is characterized by comprising the following steps: Executing an reasoning process through a dynamic MoE slice scheduler and a unified memory-memory virtualization manager; Before reasoning, the dynamic MoE slice scheduler evaluates the current video memory, the model weight loading state and the input characteristics in real time, dynamically determines the expert activated by the forward propagation and the loading mode of each expert layer, wherein the loading mode comprises complete loading or slice loading; The video memory-memory unified virtualization manager maintains a hierarchical storage pool, wherein the hierarchical storage pool comprises a hot data area, a warm data area and a cold data area, the hot data area is a GPU video memory and is used for storing current active slices, the warm data area is a high-speed unified memory, and the cold data area is a system memory/storage and is used for storing all model weights; and executing the prefetching and swapping operation on the data in the hierarchical storage pool according to the decision of the dynamic MoE slice scheduler.
2. The method of claim 1, wherein when the activated expert slices are distributed among different compute cores within the system on a chip, the different compute cores are scheduled for data exchange over a high speed bus on a chip.
3. The method of claim 1, wherein activation of important specialists is preferentially guaranteed by an importance scoring mechanism, the importance scoring being based on historical activation frequency or gradient information.
4. The method of claim 1, wherein the dynamic MoE slice scheduler, after making a slice loading plan, determines whether the memory required for the plan is less than or equal to the current memory, and if not, cuts down the secondary expert to optimize the plan until the memory required for the plan meets the current memory conditions.
5. The method of claim 1, wherein the decision basis of the dynamic slice further comprises CPU load, power, heat dissipation, forming a multi-dimensional decision model.
6. The method of claim 1, wherein the slice-loaded object further comprises a transducer block, FFN layer.
7. The method of claim 1, wherein the hierarchical memory pool is a GPU video memory-unified memory-flash memory tertiary storage structure.
8. The method of claim 1, wherein the hierarchical storage pool is a video memory-in-memory two-level storage structure.
9. The method of claim 1, wherein more experts are loaded to perform full-pattern reasoning when the mobile device is in a charged and idle state.
10. The method of claim 1, wherein the large model is a MoE model of 320 billion parameters.

Description

Large model mobile terminal reasoning method supporting MoE slice and video memory virtualization Technical Field The invention relates to the technical field of artificial intelligence computation, in particular to a large-model mobile terminal reasoning method supporting MoE slicing and video memory virtualization. Background Deploying a Large Language Model (LLM) to a mobile device is a leading edge challenge of edge AI. The mixed expert (MoE) model is a key architecture for reducing the reasoning cost because only part of expert network is activated to calculate under the condition of huge total parameters. In order to deploy the method in a resource-constrained environment, the industry explores various technologies, namely, model compression, such as low-bit quantization of expert weights, calculation unloading, distribution of partial calculation tasks to an edge server or a cloud, caching and prefetching, and utilizing a device storage hierarchy (such as DRAM as a cache of Flash) to try to prejudge and load the expert weights possibly needed in advance. Recent research, such as EdgeMoE, has proposed a MoE inference engine based on storage hierarchy partitioning. Although the MoE model itself has sparse activation characteristics, its real-time reasoning at the mobile end is still subject to severe hardware constraints, and the existing scheme has obvious shortboards: The fundamental contradiction between dynamic loading and fixed resources is that the MoE model expert activation pattern is dynamic and input dependent. The limited, fixed memory capacity of the mobile device cannot accommodate all expert weights. Existing cache policies (such as LRU) rely on temporal locality of expert access, but studies have shown that expert selection of advanced MoE models lacks such locality, resulting in very low cache hit rates and frequent I/O exchanges causing severe delays. Coarse-grained resource management results in inefficiency, as existing methods often compress or offload models as a whole. For example, simple model partitioning strategies have difficulty dealing with dynamic expert activation patterns. The calculation unloading scheme is greatly influenced by network delay and bandwidth fluctuation, and real-time performance is difficult to ensure. These coarse-grained management cannot fine tune the computational granularity of the model at run-time based on real-time available memory resources. The hysteresis and the passivity of the existing optimization strategy are passive response type whether caching based on historical access or unloading based on simple rules. They cannot proactively plan the optimal computational graph execution and data scheduling strategies according to the exact resource conditions (such as video memory, memory bandwidth, computational core load in SoC) and model characteristics of the current system before reasoning starts, resulting in the resource utilization and reasoning speed not reaching the optimality. Disclosure of Invention The invention aims to solve the problem of limitation of the physical video memory of the mobile equipment to the deployment of the large MoE model and realize smooth reasoning of video memory self-adaption. The method aims at enabling the model to dynamically adjust the activated data granularity according to the real-time video memory through a dynamic MoE slicing technology and video memory-memory hierarchical scheduling, combining with the topology aware communication optimization in the chip, and reducing the end-to-end reasoning time delay on the premise of guaranteeing the model accuracy. In order to solve the above problems, the present invention provides a large model mobile terminal reasoning method supporting MoE slice and video memory virtualization, comprising: Executing an reasoning process through a dynamic MoE slice scheduler and a unified memory-memory virtualization manager; Before reasoning, the dynamic MoE slice scheduler evaluates the current video memory, the model weight loading state and the input characteristics in real time, dynamically determines the expert activated by the forward propagation and the loading mode of each expert layer, wherein the loading mode comprises complete loading or slice loading; The video memory-memory unified virtualization manager maintains a hierarchical storage pool, wherein the hierarchical storage pool comprises a hot data area, a warm data area and a cold data area, the hot data area is a GPU video memory and is used for storing current active slices, the warm data area is a high-speed unified memory, and the cold data area is a system memory/storage and is used for storing all model weights; and executing the prefetching and swapping operation on the data in the hierarchical storage pool according to the decision of the dynamic MoE slice scheduler. Preferably, when the activated expert slices are distributed among different compute cores within the system-on-chip, the different comp