CN-121979654-A - Communication and data processing methods, apparatus, devices, media and products

CN121979654ACN 121979654 ACN121979654 ACN 121979654ACN-121979654-A

Abstract

The disclosure provides a communication and data processing method, a device, equipment, a medium and a product, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of cloud computing, large models, computing power and the like. The communication method is applied to the DPU and comprises the steps of receiving a target expert index sent by a current GPU, determining the target expert index by the current GPU based on a token sequence, storing hidden state data of all tokens in a video memory of the current GPU, executing a communication calculation process based on the target expert index to obtain communication metadata, wherein the communication metadata comprises target storage information and information of a cross-node GPU, and sending the hidden state data of the cross-node tokens stored in the video memory to the cross-node GPU based on the communication metadata. The present disclosure may offload communication computation from the GPU to the DPU, thereby balancing resource utilization on each hardware, improving overall throughput.

Inventors

SHAN QIANG
WANG WENBO

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251209

Claims (16)

1. A communication method applied to a DPU, the method comprising: Receiving a target expert index sent by a current GPU, wherein the target expert index is determined by the current GPU based on a token sequence, and hidden state data of all tokens in the token sequence are stored in a video memory of the current GPU; The communication metadata comprises target storage information and cross-node GPU information, wherein the target storage information is storage information of hidden state data of a cross-node token in the video memory, and the target expert of the cross-node token comprises a cross-node expert which is deployed on the cross-node GPU; And based on the communication metadata, sending the hidden state data of the cross-node token stored in the video memory to the cross-node GPU.
2. The method of claim 1, wherein the sending the hidden state data of the cross-node token stored in the memory to the cross-node GPU based on the communication metadata comprises: Creating a task unit, wherein the task unit comprises the target storage information and the information of the cross-node GPU; And responding to the task unit meeting a preset condition, sending a trigger instruction to a network card, so that the network card obtains the task unit based on the trigger instruction, obtains hidden state data of the cross-node token from the video memory according to the task unit, and sends the hidden state data to the cross-node GPU.
3. The method of claim 2, wherein, The current GPU, the DPU and the network card are integrated on the same chip; The target expert index is sent to the DPU by the current GPU through a first on-chip interconnection link; The trigger instruction is sent to the network card by the DPU through a second on-chip interconnection link; and the hidden state data of the target token is obtained from the video memory of the current GPU by the network card through a third on-chip interconnection link.
4. The method of claim 2, further comprising: And writing the task unit into the video memory so that the network card reads the task unit from the video memory.
5. The method of claim 2, wherein, The network card is an RNIC; the hidden state data of the cross-node token is sent to the cross-node GPU by the RNIC based on RDMA communication.
6. The method according to any one of claims 1 to 5, wherein, The DPU includes a plurality of processing units; the performing a communication calculation process based on the target expert index includes: dividing the target expert index into a plurality of groups; and (3) adopting each processing unit to respectively index each group of target experts and executing the communication calculation process in parallel.
7. A data processing method applied to a current GPU, the method comprising: Receiving a token sequence, wherein the token sequence comprises hidden state data of a plurality of tokens; storing hidden state data of all the token in the token sequence in a video memory; Determining a target expert index corresponding to the token sequence; The target expert index is sent to a DPU, so that the DPU executes a communication calculation process based on the target expert index, and hidden state data of the cross-node token stored in a video memory is sent to a cross-node GPU according to communication metadata; The target expert of the cross-node token comprises a cross-node expert, and the cross-node expert is deployed on the cross-node GPU.
8. The method of claim 7, further comprising: Determining a local token in the token sequence, wherein a target expert corresponding to the local token comprises a local expert, and the local expert is deployed on the current GPU; and carrying out expert calculation on the local token by adopting the local expert.
9. The method of claim 7, further comprising: Determining a node token in the token sequence, wherein a target expert corresponding to the node token comprises a node expert, the node expert is deployed on a node GPU, and the node GPU and the current GPU are located in the same node; And sending the hidden state data of the node token to the node GPU through an intra-node communication bus, so that the node GPU adopts the node expert to carry out expert calculation on the node token.
10. A communication device for use in a DPU, the device comprising: The system comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a target expert index sent by a current GPU, the target expert index is determined by the current GPU based on a token sequence, and hidden state data of all tokens in the token sequence are stored in a video memory of the current GPU; The communication metadata comprises target storage information and cross-node GPU information, wherein the target storage information is storage information of hidden state data of a cross-node token in the video memory, and the target expert of the cross-node token comprises a cross-node expert which is deployed on the cross-node GPU; and the communication module is used for sending the hidden state data of the cross-node token stored in the video memory to the cross-node GPU according to the communication metadata.
11. A data processing apparatus for use with a current GPU, the apparatus comprising: the receiving module is used for receiving a token sequence, wherein the token sequence comprises hidden state data of a plurality of tokens; the storage module is used for storing the hidden state data of all the token in the token sequence in a video memory; the determining module is used for determining a target expert index corresponding to the token sequence; the sending module is used for sending the target expert index to the DPU so that the DPU executes a communication calculation process based on the target expert index and sends hidden state data of the cross-node token stored in the video memory to the cross-node GPU according to communication metadata; The target expert of the cross-node token comprises a cross-node expert, and the cross-node expert is deployed on the cross-node GPU.
12. An integrated chip, comprising: The method comprises the steps of storing hidden state data of all the tokens in a received token sequence in a video memory by a current GPU, and determining a target expert index corresponding to the token sequence; the DPU is used for executing a communication calculation process according to the target expert index, creating a task unit, and sending a trigger instruction to a network card when the task unit meets a preset condition, wherein the task unit comprises information of a cross-node GPU and target storage information, the target storage information is storage information of hidden state data of the cross-node token in the video memory, the target expert of the cross-node token comprises a cross-node expert, and the cross-node expert is deployed on the cross-node GPU; The network card is used for responding to the trigger instruction, acquiring the task unit, acquiring the hidden state data of the target token from the video memory based on the target storage information in the task unit, and transmitting the hidden state data of the target token to the cross-node GPU based on the information of the cross-node GPU in the task unit.
13. The chip of claim 12, wherein, The target expert index is sent to the DPU by the current GPU through a first on-chip interconnection link; The trigger instruction is sent to the network card by the DPU through a second on-chip interconnection link; and the hidden state data of the target token is obtained from the video memory of the current GPU by the network card through a third on-chip interconnection link.
14. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
15. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.
16. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-9.

Description

Communication and data processing methods, apparatus, devices, media and products Technical Field The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of cloud computing, large models, computing power and the like, and particularly relates to a communication and data processing method, device, equipment, medium and product. Background To improve the performance of large language models (Large Language Model, LLM), a hybrid expert (Mixture of Experts, moE) network may be introduced in LLM. Disclosure of Invention The present disclosure provides a communication and data processing method, apparatus, device, medium and article. According to one aspect of the disclosure, a communication method is provided and applied to a DPU, the method comprises the steps of receiving a target expert index sent by a current GPU, wherein the target expert index is determined by the current GPU based on a token sequence, hidden state data of all tokens in the token sequence are stored in a video memory of the current GPU, executing a communication calculation process based on the target expert index to obtain communication metadata, the communication metadata comprises target storage information and information of a cross-node GPU, the target storage information is storage information of hidden state data of the cross-node tokens in the video memory, the target expert of the cross-node tokens comprises a cross-node expert, the cross-node expert is deployed on the cross-node GPU, and the hidden state data of the cross-node tokens stored in the video memory are sent to the cross-node GPU based on the communication metadata. According to another aspect of the disclosure, a data processing method is provided and applied to a current GPU, and the method comprises the steps of receiving a token sequence, storing hidden state data of all tokens in the token sequence in a video memory, determining a target expert index corresponding to the token sequence, sending the target expert index to a DPU, enabling the DPU to execute a communication calculation process based on the target expert index, and sending the hidden state data of a cross-node token stored in the video memory to a cross-node GPU according to communication metadata, wherein the target expert of the cross-node token comprises a cross-node expert, and the cross-node expert is deployed on the cross-node GPU. According to another aspect of the disclosure, a communication device is provided and applied to a DPU, the device comprises a receiving module, a calculating module and a communication module, wherein the receiving module is used for receiving a target expert index sent by a current GPU, the target expert index is determined by the current GPU based on a token sequence, hidden state data of all tokens in the token sequence are stored in a video memory of the current GPU, the calculating module is used for executing a communication calculating process according to the target expert index to obtain communication metadata, the communication metadata comprises target storage information and information of a cross-node GPU, the target storage information is storage information of hidden state data of the cross-node token in the video memory, the target expert of the cross-node token comprises a cross-node expert, the cross-node expert is deployed on the cross-node GPU, and the communication module is used for sending the hidden state data of the cross-node token stored in the video memory to the cross-node GPU according to the communication metadata. According to another aspect of the disclosure, a data processing device is provided, and the device is applied to a current GPU, and comprises a receiving module, a storage module, a determining module and a sending module, wherein the receiving module is used for receiving a token sequence, the token sequence comprises hidden state data of a plurality of tokens, the storage module is used for storing the hidden state data of all the tokens in the token sequence in a video memory, the determining module is used for determining a target expert index corresponding to the token sequence, the sending module is used for sending the target expert index to a DPU, so that the DPU can execute a communication calculation process based on the target expert index, and the hidden state data of a cross-node token stored in the video memory is sent to a cross-node GPU according to communication metadata, and the target expert of the cross-node token comprises a cross-node expert which is deployed on the cross-node GPU. According to another aspect of the disclosure, an integrated chip is provided, which comprises a current GPU, a DPU, a task unit, a network card and a network card, wherein the current GPU is used for storing hidden state data of all the tokens in a received token sequence in a video memory, determining a target expert index corresponding to the token sequence, the DPU i