CN-122021944-A - Model reasoning method, host computer, computer system, electronic device, and storage medium

CN122021944ACN 122021944 ACN122021944 ACN 122021944ACN-122021944-A

Abstract

Embodiments of the present disclosure provide a model reasoning method, a host computer, a computer system, an electronic device, and a storage medium. The method is performed by a first host in a computer system and comprises the steps of responding to receiving a first word element sent by a first computing device, determining at least one target expert sub-network corresponding to the first word element from a plurality of special sub-networks, distributing the first word element to at least one target computing device deployed with the at least one target special sub-network, enabling the at least one target computing device to perform computation on the first word element by utilizing the target expert sub-network deployed on the at least one target computing device to generate an intermediate computing result, wherein the first computing device is managed by the first host, and generating an inference result corresponding to the first word element based on the intermediate computing result generated by the at least one target computing device. The method can reduce the communication video memory overhead of the computing equipment, shorten the model end-to-end reasoning delay and improve the system reasoning throughput rate.

Inventors

Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity

Assignees

上海壁仞科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (16)

1. A model reasoning method applied to a computer system comprising a plurality of computing devices, wherein a plurality of private subnetworks are distributed across the plurality of computing devices, the method being performed by a first host in the computer system, comprising: In response to receiving a first token sent by a first computing device, determining at least one target expert sub-network corresponding to the first token from the plurality of expert sub-networks, and distributing the first token to at least one target computing device deployed with the at least one target expert sub-network, so that the at least one target computing device performs computation on the first token by using the target expert sub-network deployed thereon to generate an intermediate computation result, wherein the first computing device is managed by the first host; And generating an inference result corresponding to the first word element based on the intermediate calculation result generated by the at least one target calculation device.
2. The model reasoning method of claim 1, wherein the determining at least one target expert subnetwork corresponding to the first term from the plurality of expert subnetworks comprises: And determining indexes and weights of at least one target private sub-network corresponding to the first word element from the plurality of private sub-networks according to a routing rule.
3. The model reasoning method of claim 1, wherein the first host is located at the same computing node as the first computing device, the distributing the first token to at least one target computing device with the at least one target-specific subnetwork deployed, comprising: In response to the at least one target computing device including a second computing device located at a different computing node than the first host, sending the first token to a target computing node where the second computing device is located, so that a second host in the target computing node for managing the second computing device obtains the first token for transmission to the second computing device; In response to the at least one target computing device including a first computing device in the same computing node as the first host, adding the first token to a set of tokens to be processed for transmission to the first computing device.
4. A method of model reasoning as claimed in claim 3 wherein the sending the first token to the target computing node where the second computing device is located comprises: The first token is sent only once to the target computing node in response to there being a plurality of computing devices in the target computing node that require the first token.
5. The model reasoning method of claim 3, wherein the distributing the first token to at least one target computing device with the at least one target-specific sub-network deployed further comprises: And sending the index and the weight of the at least one target private sub-network corresponding to the first word element to the target computing node, so that a second host used for managing the second computing device in the target computing node obtains the index and the weight.
6. A model reasoning method according to claim 3, wherein the first host manages a plurality of the first computing devices, the method further comprising: in response to obtaining a second token from a remote computing node, adding the second token to the set of tokens to be processed; dividing the word element set to be processed into a plurality of subsets according to indexes of target private sub-networks corresponding to the words in the word element set to be processed, and respectively transmitting the subsets to a plurality of first computing devices managed by the first host.
7. The model reasoning method of claim 6, wherein in the computer system, the data on the computing device side is stored in a first data arrangement, the data on the host side is stored in a second data arrangement, and before determining at least one target private sub-network corresponding to the first word element from the plurality of private sub-networks, the method further comprises: converting the data arrangement of the first word element from the first data arrangement to the second data arrangement, Wherein prior to sending the plurality of subsets to the plurality of first computing devices managed by the first host, respectively, the method further comprises: and converting the data arrangement mode of each word element in the plurality of subsets from the second data arrangement mode to the first data arrangement mode.
8. The model inference method of claim 1, wherein the first word element sent by the first computing device is generated based on a result of calculation by the first computing device through an attention network.
9. The model reasoning method of claim 1, wherein generating the reasoning result corresponding to the first lemma based on the intermediate calculation result generated by the at least one target computing device comprises: And based on the weight of the at least one target expert sub-network, aggregating the intermediate calculation results generated by the at least one target calculation device to obtain an inference result corresponding to the first word element.
10. The model reasoning method of claim 1, wherein the generating the reasoning result corresponding to the first token based on the intermediate calculation result generated by the at least one target computing device is performed during processing of a token batch including the first token, the process comprising: obtaining a plurality of intermediate calculation results generated in a calculation node where the first host is located, wherein each intermediate calculation result carries word element identification information; based on the word element identification information of the plurality of intermediate calculation results, respectively performing first aggregation treatment on the plurality of intermediate calculation results to obtain at least one local aggregation result; respectively sending the at least one local aggregation result to a plurality of computing nodes in the computer system according to a word element-computing node mapping relation; obtaining an inference result of a word element corresponding to the computing device managed by the first host based on the obtained remote aggregation result and the local aggregation result, The reasoning results of the word elements corresponding to the computing equipment managed by the first host comprise reasoning results corresponding to the first word elements.
11. The model inference method as claimed in claim 10, wherein all of the tokens in the token batch are assigned global codes, The first aggregation processing is performed on the plurality of intermediate calculation results based on the word element identification information of the plurality of intermediate calculation results, so as to obtain at least one local aggregation result, including: For each intermediate calculation result obtained, based on the global code of the word element corresponding to the intermediate calculation result and the weight corresponding to the intermediate calculation result, weighting and accumulating the intermediate calculation result to the target position corresponding to the global code in the global batch tensor, Wherein the sending the at least one local aggregation result to a plurality of computing nodes in the computer system according to a word element-computing node mapping relationship includes: Based on the mapping relation between the global code and a plurality of computing nodes in the computer system, a plurality of sub-batch tensors in the global batch tensor are respectively sent to the plurality of computing nodes for obtaining by a host in the plurality of computing nodes, The method for obtaining the word element reasoning result corresponding to the computing equipment managed by the first host based on the obtained remote aggregation result and the local aggregation result comprises the following steps of: and responding to the sub-batch tensor obtained from the remote computing node, and performing second polymerization processing on the received sub-batch tensor and the local sub-batch tensor to obtain the reasoning result of the word element corresponding to the computing equipment managed by the first host.
12. The model reasoning method of claim 10, wherein obtaining a plurality of intermediate calculation results generated in a calculation node where the first host computer is located, and performing a first aggregation process on the plurality of intermediate calculation results based on word element identification information of the plurality of intermediate calculation results, respectively, includes: And starting a plurality of working threads to acquire an intermediate calculation result to be processed in a concurrent mode, and carrying out first aggregation processing on the intermediate calculation result to be processed based on the word element identification information of the intermediate calculation result to be processed.
13. A first host, the first host residing in a computer system, the computer system comprising a plurality of computing devices on which a plurality of expert subnetworks are distributed, the first host comprising: a routing module configured to determine, in response to receiving a first token sent by a first computing device, at least one target expert subnetwork corresponding to the first token from the plurality of expert subnetworks; A distribution module configured to distribute the first token to at least one target computing device having the at least one target-specific sub-network deployed thereon, such that the at least one target computing device performs a computation on the first token using the target-specific sub-network deployed thereon to generate an intermediate computation result, wherein the first computing device is managed by the first host; And the combination module is configured to generate an inference result corresponding to the first word element based on the intermediate calculation result generated by the at least one target calculation device.
14. A computer system comprising a plurality of computing devices, a plurality of expert subnetworks distributed across the plurality of computing devices, a first host in the computer system configured to: In response to receiving a first token sent by a first computing device, determining at least one target expert sub-network corresponding to the first token from the plurality of expert sub-networks, and distributing the first token to at least one target computing device deployed with the at least one target expert sub-network, so that the at least one target computing device performs computation on the first token by using the target expert sub-network deployed thereon to generate an intermediate computation result, wherein the first computing device is managed by the first host; And generating an inference result corresponding to the first word element based on the intermediate calculation result generated by the at least one target calculation device.
15. An electronic device, the electronic device comprising: At least one processor; At least one memory including one or more computer program modules; wherein the one or more computer program modules are stored in the at least one memory and configured to be executed by the at least one processor, the one or more computer program modules being for implementing the method of any of claims 1-12.
16. A non-transitory computer-readable storage medium storing computer-readable instructions, wherein the computer-readable instructions, when executed by at least one processor, perform the method of any of claims 1-12.

Description

Model reasoning method, host computer, computer system, electronic device, and storage medium Technical Field Embodiments of the present disclosure relate to the field of artificial intelligence, and in particular, to a model reasoning method, a host computer, a computer system, an electronic device, and a storage medium. Background With the rapid development of artificial intelligence technology, the number of parameters of a large-scale model increases exponentially. To further expand the model parameter capacity while maintaining computational efficiency, a hybrid expert (Mixture of Experts, moE) model architecture may be applied. The reasoning process of the mixed expert model covers complex links such as attention network calculation, routing decision, word element distribution (Dispatch), expert sub-network calculation, result combination (Combine) and the like, and high requirements are put on the computing power and communication bandwidth of the computing equipment. Therefore, model reasoning on how to achieve high efficiency and low delay has become one of the technical challenges that needs to be addressed currently. Disclosure of Invention At least one embodiment of the present disclosure provides a model reasoning method, wherein the method is applied to a computer system comprising a plurality of computing devices, wherein a plurality of private sub-networks are distributed and deployed on the plurality of computing devices, the method is performed by a first host in the computer system, and comprises the steps of responding to a first word element sent by the first computing device, determining at least one target private sub-network corresponding to the first word element from the plurality of private sub-networks, and distributing the first word element to at least one target computing device deployed with the at least one target private sub-network, so that the at least one target computing device performs computation on the first word element by utilizing the target private sub-network deployed on the at least one target computing device to generate an intermediate computing result, wherein the first computing device is managed by the first host, and generating the reasoning result corresponding to the first word element based on the intermediate computing result generated by the at least one target computing device. In the model reasoning method provided in at least one embodiment of the present disclosure, the determining at least one target private sub-network corresponding to the first term from the plurality of private sub-networks includes determining an index and a weight of the at least one target private sub-network corresponding to the first term from the plurality of private sub-networks according to a routing rule. In the model reasoning method provided by at least one embodiment of the present disclosure, the first host and the first computing device are located at the same computing node, the distributing the first word element to at least one target computing device deployed with the at least one target-specific sub-network includes, in response to the at least one target computing device including a second computing device located at a different computing node from the first host, sending the first word element to the target computing node where the second computing device is located, so that a second host in the target computing node for managing the second computing device obtains the first word element for transmission to the second computing device, and in response to the at least one target computing device including the first computing device located at the same computing node as the first host, adding the first word element to a set of word elements to be processed for transmission to the first computing device. In the model reasoning method provided by at least one embodiment of the present disclosure, the sending the first word element to the target computing node where the second computing device is located includes sending the first word element only once to the target computing node in response to a plurality of computing devices requiring the first word element exist in the target computing node. In the model reasoning method provided in at least one embodiment of the present disclosure, the distributing the first term to at least one target computing device deployed with the at least one target-specific sub-network further includes sending an index and a weight of the at least one target-specific sub-network corresponding to the first term to the target computing node, so that a second host for managing the second computing device in the target computing node obtains the index and the weight. In the model reasoning method provided by at least one embodiment of the present disclosure, the first host manages a plurality of the first computing devices, and the method further includes adding a second word element to the to-be-processed word element set in respons