CN-121997983-A - Acceleration method of neural network model, PNM device and GPU device

CN121997983ACN 121997983 ACN121997983 ACN 121997983ACN-121997983-A

Abstract

The disclosure provides an acceleration method of a neural network model, a PNM device and a GPU device. The neural network model is a neural network model adopting a MoE architecture, under the MoE architecture, the expert model is divided into a hot expert model and a cold expert model according to the number of times of being activated, the hot expert model is stored in an external device, and the acceleration method is executed by the PNM device and comprises the steps of calculating by using a selected cold expert model in the cold expert models stored in the PNM device to obtain cold expert model output, transmitting the cold expert model output from the PNM device to the external device, and the external device comprises a GPU device. The acceleration method of the neural network model according to at least one example embodiment of the present disclosure can reduce the amount of data transmitted to the GPU device, thereby solving the data transmission bottleneck problem.

Inventors

WANG SHUYANG
ZHANG YUQI

Assignees

三星（中国）半导体有限公司
三星电子株式会社

Dates

Publication Date: 20260508
Application Date: 20251224

Claims (19)

1. An acceleration method of a neural network model, characterized in that the neural network model is a neural network model employing a MoE architecture under which an expert model is divided into a hot expert model and a cold expert model according to the number of times activated, the hot expert model being stored in an external device, the acceleration method being performed by a PNM device, the acceleration method comprising: calculating with a selected cold expert model of the cold expert models stored by the PNM device to obtain a cold expert model output; Transmitting the cold expert model output from the PNM device to the external device, the external device comprising a GPU device.
2. The method of accelerating a neural network model of claim 1, wherein transmitting the cold expert model output comprises transmitting the cold expert model output to the GPU device without transmitting model parameters of the cold expert model.
3. The method of accelerating a neural network model of claim 1, wherein the PNM device comprises a plurality of PNM devices, the cold expert model being a plurality, the plurality of cold expert models being stored separately on the plurality of PNM devices.
4. A method of accelerating a neural network model according to claim 3, wherein expert models of the plurality of cold expert models that have co-occurrence relationships are stored separately on different PNM devices of the plurality of PNM devices.
5. The acceleration method of a neural network model of claim 1, further comprising: Identifying at least one cold expert model stored in the PNM device as a hot expert model during computation of the neural network model; transmitting the at least one cold expert model to the GPU device; Deleting the at least one cold expert model from the PNM device, and At least one expert model dynamically determined as a cold expert model during computation of the neural network model is received from the GPU device, Wherein the at least one expert model is deleted from the GPU device after being provided to the PNM device.
6. The method of accelerating a neural network model of claim 1, wherein the step of calculating using a selected cold expert model of the cold expert models stored by the PNM device to obtain a cold expert model output comprises: Identifying dense portions in the input data of the cold expert model; Identifying important neurons in the cold expert model for which a neuron importance measure exceeds a preset threshold to obtain the cold expert model output, Wherein the cold expert model involved in the computation is determined from cold expert selection information received from the external device, the cold expert selection information being provided by a gating network in the MoE architecture, the gating network being deployed in the GPU device.
7. An acceleration method of a neural network model according to claim 1, characterized in that the PNM device is a CMM-DC device, the external device further comprising a host device.
8. A PNM device, characterized in that it comprises: A storage module configured to store a cold expert model in a MoE-architecture-based neural network model, wherein the MoE-architecture expert model is divided into a hot expert model and a cold expert model according to the number of times activated, the hot expert model being stored in an external device; A computing module configured to perform a computation using a selected one of the cold expert models to obtain a cold expert model output; And an input-output module configured to transmit the cold expert model output to an external device, the external device comprising a GPU device.
9. An acceleration method of a neural network model, wherein the neural network model is a neural network model adopting a MoE architecture in which expert models are divided into a hot expert model and a cold expert model according to the number of times activated, the acceleration method comprising: Selecting a thermal expert model through a gating network stored in the GPU equipment; calculating by using the thermal expert model to obtain a thermal expert model output; An output of each transform layer of the neural network model is calculated based on the hot expert model output, the cold expert model output received from an external device, including a PNM device, and the weights determined by the gating network.
10. An acceleration method of a neural network model of claim 9, further comprising transmitting cold expert selection information to the PNM device, the cold expert selection information provided by the gating network.
11. The acceleration method of a neural network model of claim 9, further comprising: Identifying at least one hot expert model stored in the GPU device as a cold expert model during computation of the neural network model; Transmitting the at least one thermal expert model to the PNM device, and At least one expert model dynamically determined as a thermal expert model is received from the PNM device.
12. The acceleration method of a neural network model of claim 9, wherein the PNM device comprises a plurality of PNM devices, the cold expert model being a plurality of cold expert models stored separately in the plurality of PNM devices, the acceleration method further comprising: identifying co-occurrence relationships reflecting that the plurality of cold expert models are activated together; Obtaining a graph comprising a plurality of nodes and edges, the nodes of the graph representing expert models, the edges of the graph representing co-occurrence relationships; calculating a sum of ownership weights of each of the plurality of nodes; sorting the sum of the weights of each node; different colors are distributed according to the arranged sequence; Different PNM devices are assigned to the plurality of cold expert models based on different colors.
13. The method of accelerating a neural network model of claim 12, wherein the step of assigning different colors in the ordered sequence comprises: For a first node of a plurality of nodes, collecting a set of colored colors for neighboring nodes of the first node, The first node is assigned an unused color that is not included in the set of colors.
14. The method for accelerating a neural network model of claim 13, In response to no colors being available, all colors are traversed and an impact factor for each color is calculated and the first node is assigned the color with the smallest impact factor.
15. A GPU device, the GPU device comprising: the storage module is used for storing a thermal expert model of a neural network model adopting a MoE architecture, and under the MoE architecture, the expert model is divided into the thermal expert model and a cold expert model according to the activated times; The gating network selects a thermal expert model participating in calculation; A calculation module performs a calculation with the selected thermal expert model to obtain a thermal expert model output, and calculates an output for each transform layer of the neural network model based on the thermal expert model output, a cold expert model output received from an external device, the external device including a PNM device, and weights for the cold expert model output and the thermal expert model output determined by the gating network.
16. A method for managing a neural network model, wherein the neural network model is a neural network model adopting a MoE architecture, the method being performed by a GPU device or a host device, the method comprising: Determining the use condition of the expert model under the MoE architecture in the training process of the neural network model; The expert model is pre-divided into a cold expert model and a hot expert model according to the use case, wherein the cold expert model is to be pre-stored in the PNM device and the hot expert model is to be pre-stored in the GPU device.
17. The method of managing a neural network model of claim 16, further comprising: In the reasoning calculation process of the neural network model, counting the activation condition of the expert model; dynamically determining whether each expert model in the expert models is a hot expert model or a cold expert model according to the activation condition of the expert models; dynamically adjusting the location of the expert model between the GPU device and the PNM device.
18. A computer system comprising a PNM device according to claim 8 and a GPU device according to claim 15.
19. A host device, the host device comprising: a non-transitory computer readable recording medium configured to store instructions, A processor configured to execute the instructions to cause the processor to perform: identifying co-occurrence relationships reflecting that the plurality of cold expert models are activated together; Obtaining a graph comprising a plurality of nodes and edges, the nodes of the graph representing expert models, the edges of the graph representing co-occurrence relationships; calculating a sum of ownership weights of each of the plurality of nodes; sorting the sum of the weights of each node; different colors are distributed according to the arranged sequence; Different PNM devices are assigned to the plurality of cold expert models based on different colors.

Description

Acceleration method of neural network model, PNM device and GPU device Technical Field The present disclosure relates to the field of artificial intelligence acceleration, and more particularly, to a neural network model acceleration method, PNM device, and GPU device. Background In recent years, the parameters (including weights and biases) of large-scale neural networks (e.g., large Language Models (LLMs)) have been scaled to billions of parameters. However, these densely activated models require a large amount of computation. Large-scale neural networks such as LLM may employ a hybrid expert (MoE) architecture in which expert models are stored in the memory of a Graphics Processing Unit (GPU). In addition, the expert model under the MoE architecture may also be offloaded onto a storage device on the host side (or host device), such as Dynamic Random Access Memory (DRAM), disk, etc., to provide more storage space for the model. However, storing the expert model under the MoE architecture in the video memory of the GPU, the parameter storage space of the expert model will occupy a large part of the storage space of the entire neural network model, and it is difficult for a single GPU to provide sufficient storage space for the neural network model. In addition, if the expert model under the MoE architecture is stored using a storage device at the host end, more storage space can be provided for it. However, during the reasoning process, the GPU can perform the next calculation only after loading the parameters of the expert model under the MoE architecture from the DRAM at the host side to the GPU, which requires that a large number of parameters of the activated expert model be transferred from the DRAM to the GPU, which significantly increases the access overhead. Therefore, the scheme has the defects that firstly, a large amount of data needs to be transmitted due to excessive parameters of the selected expert model, and a data transmission bottleneck exists, and secondly, the GPU is in an idle state when waiting for the expert model data, so that the calculation power of the GPU is wasted, and the reasoning efficiency of the large-scale neural network is low. The above information is provided merely as background information and is not meant to constitute prior art to the present disclosure. Disclosure of Invention It is an object of the present disclosure to provide an acceleration method of a neural network model, which can reduce the cost per unit storage space of an expert model. It is an object of the present disclosure to provide an acceleration method of a neural network model, which can reduce the amount of data transmitted to a GPU device, thereby solving the bottleneck problem of data transmission. It is an object of the present disclosure to provide an acceleration method of a neural network model, which can reduce the overall inference time of the neural network model. According to a first aspect of the present disclosure, there is provided an acceleration method of a neural network model, the neural network model being a neural network model employing a MoE architecture under which an expert model is divided into a hot expert model and a cold expert model according to the number of times of activation, the hot expert model being stored in an external device, the acceleration method being performed by a PNM device, the acceleration method including performing calculation using a selected cold expert model of the cold expert models stored by the PNM device to obtain a cold expert model output, transmitting the cold expert model output from the PNM device to the external device, the external device including a GPU device. Alternatively, the step of transmitting the cold expert model output may include transmitting the cold expert model output to the GPU device without transmitting model parameters of the cold expert model. Alternatively, the PNM device may comprise a plurality of PNM devices, the cold expert model may be a plurality, and the plurality of cold expert models may be stored separately on the plurality of PNM devices. Alternatively, expert models of the plurality of cold expert models that have co-occurrence relationships may be stored separately on different PNM devices of the plurality of PNM devices. Optionally, the acceleration method may further include identifying at least one cold expert model stored in the PNM device as a hot expert model during computation of the neural network model, transmitting the at least one cold expert model to the GPU device, deleting the at least one cold expert model from the PNM device, and receiving the at least one expert model dynamically determined as a cold expert model during computation of the neural network model from the GPU device, wherein the at least one expert model is deleted from the GPU device after being provided to the PNM device. Optionally, the step of computing with a selected one of the cold expert models stored by the PNM