CN-121979666-A - CPU, decoding method based on hybrid expert network and related products

CN121979666ACN 121979666 ACN121979666 ACN 121979666ACN-121979666-A

Abstract

The disclosure provides a CPU, a decoding method based on a hybrid expert network and related products, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of cloud computing, large models, computing power and the like. The CPU comprises a general calculation function unit, an AI calculation function unit and an AI chip, wherein the general calculation function unit is used for receiving a decoding request and sending the decoding request to the AI chip so that the AI chip obtains attention processing results of the historical token based on pre-filling result information contained in the decoding request and determines a target token in the historical token, the AI calculation function unit is used for receiving the attention processing results of the target token and carrying out expert processing, and the AI calculation function unit is used for sending the expert processing results to the AI chip so that the AI chip generates the current token based on the expert processing results. The present disclosure can offload hybrid expert processing from the GPU to the CPU, thereby balancing resource utilization on each hardware, improving overall throughput.

Inventors

LIU YUEJI
CHI ZHIGANG
LIU XINGXING
LI YU
LIU JINGLIANG

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251223

Claims (15)

1. A CPU, comprising: The general calculation function unit is used for receiving a decoding request and sending the decoding request to the AI chip, wherein the decoding request comprises pre-filling result information of a history token, so that the AI chip obtains an attention processing result of the history token based on the pre-filling result information and determines a target token in the history token; And the AI calculation function unit is used for receiving the attention processing result of the target token, carrying out expert processing on the attention processing result to obtain an expert processing result of the target token, and sending the expert processing result to the AI chip so that the AI chip generates the current token based on the expert processing result.
2. The CPU of claim 1 wherein the general purpose computing functional unit comprises: the system comprises a general computing logic unit, a general computing memory, an IO control logic unit and a general computing interconnection bus; The general computing logic unit, the general computing memory and the IO control logic unit are all connected to the general computing interconnection bus.
3. The CPU of claim 2 wherein the AI computation function unit includes: the AI calculation logic unit, the AI calculation memory and the AI calculation interconnection bus; the AI calculation logic unit and the AI calculation memory are both connected to the AI calculation interconnection bus, and the AI calculation interconnection bus is connected with the general calculation interconnection bus.
4. The CPU according to claim 3, wherein, The attention processing result is that the AI chip sends the attention processing result to the AI calculation logic unit through the IO control logic unit, the general calculation interconnection bus and the AI calculation interconnection bus; A target expert network corresponding to the target token is pre-deployed in the AI calculation memory; the AI calculation logic unit is specifically configured to perform expert processing on the attention processing result of the target token by using the target expert network, so as to obtain the expert processing result.
5. The CPU according to claim 2, wherein, The decoding request is sent to the general computing logic unit through the IO control logic unit and the general computing interconnection bus; the general calculation memory is pre-recorded with a scheduling rule; the general calculation logic unit is specifically configured to send the decoding request to the AI chip according to the scheduling rule.
6. The CPU of claim 1, wherein, And the CPU and the AI chip are interconnected through an XLink link, a PCIe link or a network card link.
7. A decoding method based on a hybrid expert network, applied to a CPU, the method comprising: Receiving a decoding request, wherein the decoding request comprises pre-filling result information of a history token; Sending the decoding request to an AI chip so that the AI chip obtains the attention processing result of the history token based on the pre-filling result information, and determining a target token in the history token; Receiving an attention processing result of the target token, and performing expert processing on the attention processing result to obtain an expert processing result of the target token; and sending the expert processing result to the AI chip so that the AI chip generates a current token based on the expert processing result.
8. The method of claim 7, wherein, The decoding request is processed by a general purpose computing functional unit in the CPU; The expert processing is performed by using an AI calculation function unit in the CPU; the general purpose computing logic unit and the AI computing logic unit are independent of each other.
9. The method of claim 7, wherein the receiving the attention processing result of the target token, performing expert processing on the attention processing result to obtain an expert processing result of the target token, comprises: Receiving the attention processing result of the target token; and performing expert processing on the attention processing result by adopting a target expert network which is deployed in the CPU in advance and corresponds to the target token so as to obtain an expert processing result of the target token.
10. A decoding method based on a hybrid expert network, applied to an AI chip, the method comprising: Receiving a decoding request sent by a CPU, wherein the decoding request contains pre-filling result information of a history token; Acquiring attention processing results of the history token based on the pre-filling result information, and determining a target token in the history token; Sending the attention processing result of the target token to the CPU so that the CPU carries out expert processing on the attention processing result to obtain an expert processing result of the target token; And receiving the expert processing result and generating a current token based on the expert processing result.
11. A hybrid expert network-based decoding apparatus for use in a CPU, the apparatus comprising: the receiving module is used for receiving a decoding request, wherein the decoding request comprises pre-filling result information of a history token; The first sending module is used for sending the decoding request to an AI chip so that the AI chip obtains the attention processing result of the history token based on the pre-filling result information and determines a target token in the history token; the processing module is used for receiving the attention processing result of the target token and carrying out expert processing on the attention processing result so as to obtain the expert processing result of the target token; And the second sending module is used for sending the expert processing result to the AI chip so that the AI chip generates a current token based on the expert processing result.
12. A hybrid expert network-based decoding device for use with an AI chip, the device comprising: The receiving module is used for receiving a decoding request sent by the CPU, wherein the decoding request contains pre-filling result information of a history token; the processing module is used for acquiring attention processing results of the history token based on the pre-filling result information and determining a target token in the history token; The sending module is used for sending the attention processing result of the target token to the CPU so that the CPU carries out expert processing on the attention processing result to obtain an expert processing result of the target token; And the generation module is used for receiving the expert processing result and generating a current token based on the expert processing result.
13. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.
15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-6.

Description

CPU, decoding method based on hybrid expert network and related products Technical Field The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of cloud computing, large models, computing power and the like, and specifically relates to a CPU, a decoding method, a decoding device, decoding equipment, a decoding medium and a decoding product based on a hybrid expert network. Background To improve the performance of large language models (Large Language Model, LLM), a hybrid expert (Mixture of Experts, moE) network may be introduced in LLM. Disclosure of Invention The present disclosure provides a CPU, a hybrid expert network-based decoding method, apparatus, device, medium, and product. According to one aspect of the disclosure, there is provided a CPU, including a general purpose computing function unit configured to receive a decoding request and send the decoding request to an AI chip, where the decoding request includes pre-filling result information of a history token, so that the AI chip obtains an attention processing result of the history token based on the pre-filling result information and determines a target token in the history token, and an AI computing function unit configured to receive the attention processing result of the target token, perform expert processing on the attention processing result to obtain an expert processing result of the target token, and send the expert processing result to the AI chip, so that the AI chip generates a current token based on the expert processing result. According to another aspect of the disclosure, a decoding method based on a hybrid expert network is provided and applied to a CPU, and the method comprises the steps of receiving a decoding request, wherein the decoding request comprises pre-filling result information of a history token, sending the decoding request to an AI chip, enabling the AI chip to acquire attention processing results of the history token based on the pre-filling result information, determining a target token in the history token, receiving the attention processing results of the target token, conducting expert processing on the attention processing results to obtain expert processing results of the target token, and sending the expert processing results to the AI chip, so that the AI chip generates a current token based on the expert processing results. According to another aspect of the disclosure, a decoding method based on a hybrid expert network is provided and applied to an AI chip, and the method comprises the steps of receiving a decoding request sent by a CPU, wherein the decoding request comprises pre-filling result information of a history token, acquiring attention processing results of the history token based on the pre-filling result information, determining a target token in the history token, sending the attention processing results of the target token to the CPU, enabling the CPU to carry out expert processing on the attention processing results to obtain expert processing results of the target token, receiving the expert processing results, and generating a current token based on the expert processing results. According to another aspect of the disclosure, a decoding device based on a hybrid expert network is provided, and the decoding device is applied to a CPU, and comprises a receiving module, a first sending module and a processing module, wherein the receiving module is used for receiving a decoding request, the decoding request comprises pre-filling result information of a history token, the first sending module is used for sending the decoding request to an AI chip, so that the AI chip obtains attention processing results of the history token based on the pre-filling result information and determines a target token in the history token, the processing module is used for receiving the attention processing results of the target token, conducting expert processing on the attention processing results to obtain expert processing results of the target token, and the second sending module is used for sending the expert processing results to the AI chip, so that the AI chip generates a current token based on the expert processing results. According to another aspect of the disclosure, a decoding device based on a hybrid expert network is provided, and the decoding device is applied to an AI chip, and comprises a receiving module, a processing module and a sending module, wherein the receiving module is used for receiving a decoding request sent by a CPU, the decoding request comprises pre-filling result information of a history token, the processing module is used for acquiring attention processing results of the history token based on the pre-filling result information and determining a target token in the history token, the sending module is used for sending the attention processing results of the target token to the CPU so that the CPU carries out expert pr