CN-122021926-A - Inference operation method and inference operation device

CN122021926ACN 122021926 ACN122021926 ACN 122021926ACN-122021926-A

Abstract

The present disclosure relates to an inference operation method and an inference operation apparatus. The reasoning operation method comprises at least one of outputting first activation parameters generated by the last stage of the one-stage or cascade-connected multistage converter layers by the calculation component for at least one calculation component, wherein the calculation inside the one-stage or cascade-connected multistage converter layers is performed by the calculation component, storing weight parameters in the converter layers in a corresponding first storage component for each of the converter layers, wherein the corresponding first storage component is arranged at the local of a corresponding one of the calculation components for performing the calculation inside the converter layers, or storing key parameters and value parameters in the converter layers in a corresponding second storage component for each of the converter layers, wherein the corresponding second storage component is arranged at the local of a corresponding one of the calculation components for performing the calculation inside the converter layers.

Inventors

XIANG ZHIHONG

Assignees

杭州研极微电子有限公司

Dates

Publication Date: 20260512
Application Date: 20260213

Claims (20)

1. A method of reasoning about an attention mechanism based model, wherein the attention mechanism based model includes one or more transducer layers, and computation within each transducer layer is performed by a respective one of the computing components, the method of reasoning comprising at least one of: Outputting, by the computing component, for at least one computing component, a first activation parameter generated by a last one of the one or cascade-connected multi-stage convertors layers, wherein a computation within the one or cascade-connected multi-stage convertors layers is performed by the computing component; For each transducer layer, storing the weight parameters in the transducer layer in a respective first storage means, wherein the respective first storage means is provided locally to the respective one of the computation means for performing computations inside the transducer layer, or For each transducer layer, the key parameters and value parameters in the transducer layer are stored in a respective second storage component, wherein the respective second storage component is located locally to the respective one of the computing components for performing computations inside the transducer layer.
2. The inference algorithm of claim 1, wherein the at least one computing component comprises one or more first computing sub-components, each first computing sub-component configured to perform computations internal to a respective one of the fransformer layers, and each first computing sub-component configured to output a first activation parameter generated by the respective one of the fransformer layers.
3. The inference algorithm of claim 2, wherein the at least one first storage component comprises one or more first storage sub-components, each first storage sub-component being disposed locally to a respective one of the first computing sub-components, and each first storage sub-component being configured to store weight parameters in a transducer layer corresponding to the respective one of the first computing sub-components.
4. The inference algorithm of claim 2, wherein the at least one second storage component comprises one or more second storage sub-components, each second storage sub-component being disposed locally to a respective one of the first computing sub-components, and each second storage sub-component being configured to store key parameters and value parameters in a transducer layer corresponding to the respective one of the first computing sub-components.
5. The inference algorithm of claim 4, wherein the at least one second storage sub-component comprises at least one of: A first storage unit configured to store key parameters in the corresponding one of the transducer layers, or And a second storage unit configured to store a value parameter in the corresponding one of the transducers.
6. The inference algorithm of claim 1, wherein the at least one computing component comprises at least one of: A second computing sub-component configured to perform computation inside the attention layer in a corresponding one of the transducers and output a second activation parameter generated by the attention layer, or A third computation sub-component configured to perform computations internal to the feedforward neural network layer in a corresponding one of the transformers and to output the first activation parameters generated by the feedforward neural network layer.
7. The inference algorithm of claim 6, wherein the weight parameters include a query weight matrix configured to calculate a query parameter, a key weight matrix configured to calculate a key parameter, a value weight matrix configured to calculate a value parameter, a first linear weight parameter for the attention layer, and a second linear weight parameter for the feedforward neural network layer, Storing the weight parameters in the transducer layer in the respective first storage means comprises at least one of: Storing the query weight matrix, the key weight matrix, the value weight matrix and the first linear weight parameter in a third storage sub-component of the respective first storage component, wherein the third storage sub-component is provided locally to one second calculation sub-component corresponding to the attention layer, or The second linear weight parameters are stored in a fourth storage sub-component of the respective first storage component, wherein the fourth storage sub-component is located locally to a third computation sub-component corresponding to the feedforward neural network layer.
8. The inference algorithm of claim 6 wherein each second storage component is located locally to one second computation sub-component that performs computations internal to the attention layer in a corresponding one of the converters layers, or The at least one second storage component comprises at least one of: a fifth storage sub-component configured to store key parameters in a corresponding one of the transducer layers, or A sixth storage sub-component configured to store the value parameters in a corresponding one of the transducer layers.
9. The inference operation method according to claim 1, further comprising: in response to monitoring that a first operational load of one computing component is greater than or equal to a preset load threshold, transferring at least a portion of the computation performed by the one computing component to another computing component having a second operational load, wherein the first operational load is greater than the second operational load.
10. The inference operation method according to claim 1, wherein the first storage density of the first storage means is greater than the second storage density of the second storage means, and/or The first transmission bandwidth of the first storage component is less than the second transmission bandwidth of the second storage component.
11. An inference computation device for an attention mechanism based model, wherein the attention mechanism based model includes one or more Transformer layers, the inference computation device comprising: One or more computing components, at least one computing component configured to perform a computation within a respective one or more cascaded multistage fransformer layers and configured to output a first activation parameter generated by a last one of the respective one or more cascaded multistage fransformer layers: One or more first storage components, each configured to store weight parameters in a respective transducer layer, and each disposed local to and communicatively connected with a respective one of the computing components for performing computations within the respective transducer layer, and One or more second storage components, each configured to store key parameters and value parameters in a respective transducer layer, and each disposed locally and communicatively connected to a respective one of the computing components for performing computations within the respective transducer layer.
12. The inference computing device of claim 11, wherein the at least one computing component comprises one or more first computing sub-components, each first computing sub-component configured to perform computations internal to a respective one of the fransformer layers, and each first computing sub-component configured to output a first activation parameter generated by the respective one of the fransformer layers.
13. The inference computing device of claim 12, wherein the at least one first storage component comprises one or more first storage sub-components, each first storage sub-component being disposed locally to a respective one of the first computing sub-components, and each first storage sub-component being configured to store weight parameters in a transducer layer corresponding to the respective one of the first computing sub-components.
14. The inference computing device of claim 12, wherein the at least one second storage component comprises one or more second storage sub-components, each second storage sub-component being disposed locally to a respective one of the first computing sub-components, and each second storage sub-component being configured to store key parameters and value parameters in a transducer layer corresponding to the respective one of the first computing sub-components.
15. The inference algorithm device of claim 14, wherein the at least one second storage sub-component comprises at least one of: A first storage unit configured to store key parameters in the corresponding one of the transducer layers, or And a second storage unit configured to store a value parameter in the corresponding one of the transducers.
16. The inference computing device of claim 11, wherein at least one computing component comprises at least one of: A second computing sub-component configured to perform computation inside the attention layer in a corresponding one of the transducers and output a second activation parameter generated by the attention layer, or A third computation sub-component configured to perform computations internal to the feedforward neural network layer in a corresponding one of the transformers and to output the first activation parameters generated by the feedforward neural network layer.
17. The inference operation device of claim 16, wherein the weight parameters include a query weight matrix configured to calculate a query parameter, a key weight matrix configured to calculate a key parameter, a value weight matrix configured to calculate a value parameter, a first linear weight parameter for the attention layer, and a second linear weight parameter for the feedforward neural network layer, The first storage means comprises at least one of: A third storage sub-component configured to store a query weight matrix, a key weight matrix, a value weight matrix, and a first linear weight parameter in a respective one of the transducer layers, wherein the third storage sub-component is disposed locally to a second computing sub-component corresponding to an attention layer in the respective one of the transducer layers, or And a fourth storage sub-component configured to store the second linear weight parameters in a corresponding one of the transducer layers, wherein the fourth storage sub-component is disposed locally to a third computation sub-component corresponding to a feedforward neural network layer in the corresponding one of the transducer layers.
18. The inference operation device of claim 16, wherein each second storage component is provided local to one second computation sub-component performing computation inside an attention layer in a corresponding one of the converters layers, or The at least one second storage component comprises at least one of: a fifth storage sub-component configured to store key parameters in a corresponding one of the transducer layers, or A sixth storage sub-component configured to store the value parameters in a corresponding one of the transducer layers.
19. The inference operation device according to claim 11, further comprising: And a load balancing module configured to monitor an operational load of the one or more computing components and, in response to monitoring that a first operational load of one computing component is greater than or equal to a preset load threshold, transfer at least a portion of the computation performed by the one computing component to another computing component having a second operational load, wherein the first operational load is greater than the second operational load.
20. The inference operation device of claim 11, wherein the first storage density of the first storage means is greater than the second storage density of the second storage means, and/or The first transmission bandwidth of the first storage component is less than the second transmission bandwidth of the second storage component.

Description

Inference operation method and inference operation device Technical Field The present disclosure relates to the field of artificial intelligence technology, and more particularly, to an inference operation method, an inference operation device, a computer-readable storage medium, and a computer program product for an attention mechanism-based model. Background In the field of artificial intelligence technology, with the rapid increase (e.g., exponential increase) in the parametric scale and computational power requirements of large models, the performance requirements for "network-storage-computing" collaborative architectures have approached the physical limits of existing hardware technologies. The rigidity dependence on the strong interconnection architecture becomes a core technical bottleneck for restricting the expansion of calculation power of a large model, the control of cost and the optimization of energy efficiency. Disclosure of Invention It is an object of the present disclosure to provide an inference calculation method, an inference calculation device, a computer-readable storage medium and a computer program product for an attention mechanism based model. According to a first aspect of the present disclosure, there is provided an inference operation method for an attention mechanism based model, wherein the attention mechanism based model includes one or more Transformer layers, and computation inside each Transformer layer is performed by a respective one of computing components, the inference operation method comprising at least one of: Outputting, by the computing component, for at least one computing component, a first activation parameter generated by a last one of the one or cascade-connected multi-stage convertors layers, wherein a computation within the one or cascade-connected multi-stage convertors layers is performed by the computing component; For each transducer layer, storing the weight parameters in the transducer layer in a respective first storage means, wherein the respective first storage means is provided locally to the respective one of the computation means for performing computations inside the transducer layer, or For each transducer layer, the key parameters and value parameters in the transducer layer are stored in a respective second storage component, wherein the respective second storage component is located locally to the respective one of the computing components for performing computations inside the transducer layer. In some embodiments, the at least one computing component includes one or more first computing sub-components, each configured to perform computations internal to a respective one of the fransformer layers, and each configured to output first activation parameters generated by the respective one of the fransformer layers. In some embodiments, the at least one first storage sub-component includes one or more first storage sub-components, each first storage sub-component disposed locally to a respective one of the first computing sub-components, and each first storage sub-component is configured to store weight parameters in a transducer layer corresponding to the respective one of the first computing sub-components. In some embodiments, the at least one second storage sub-component includes one or more second storage sub-components, each second storage sub-component disposed locally to a respective one of the first computing sub-components, and each second storage sub-component is configured to store key parameters and value parameters in a transducer layer corresponding to the respective one of the first computing sub-components. In some embodiments, the at least one second storage sub-component comprises at least one of: A first storage unit configured to store key parameters in the corresponding one of the transducer layers, or And a second storage unit configured to store a value parameter in the corresponding one of the transducers. In some embodiments, the at least one computing component comprises at least one of: A second computing sub-component configured to perform computation inside the attention layer in a corresponding one of the transducers and output a second activation parameter generated by the attention layer, or A third computation sub-component configured to perform computations internal to the feedforward neural network layer in a corresponding one of the transformers and to output the first activation parameters generated by the feedforward neural network layer. In some embodiments, the weight parameters include a query weight matrix configured to calculate a query parameter, a key weight matrix configured to calculate a key parameter, a value weight matrix configured to calculate a value parameter, a first linear weight parameter for the attention layer, and a second linear weight parameter for the feedforward neural network layer, Storing the weight parameters in the transducer layer in the respective first storage means co