CN-121785806-B - Model training method, supernode system, electronic device, medium and program product

CN121785806BCN 121785806 BCN121785806 BCN 121785806BCN-121785806-B

Abstract

The application provides a model training method, a supernode system, electronic equipment, a medium and a program product, wherein the model training method comprises the steps of carrying out back propagation calculation through a first supernode according to a received error and a held tensor to obtain a local model gradient corresponding to the tensor held by the first supernode; the method comprises the steps of transmitting local model gradients to a plurality of connected line cards through a plurality of first supernodes, slicing the received local model gradients into a plurality of cells by the line cards, spraying the cells to each screen, transmitting the cells to the line cards connected with the second supernodes by the screen, recombining the cells received into the local model gradients by the line cards connected with the second supernodes, transmitting the local model gradients obtained through recombination to each second supernode, and updating locally stored model parameters by the second supernodes according to the received local model gradients and the locally stored local model gradients. The application can improve the reliability of model training of the supernode system.

Inventors

CHEN FU
DONG MO

Assignees

上海东方算芯科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260304

Claims (13)

1. The model training method is characterized by being applied to a supernode system, wherein the supernode system comprises a plurality of supernodes, a plurality of line cards and a plurality of net plates, the line cards and the net plates form a full interconnection structure, and each supernode in the plurality of supernodes is connected with at least one line card; The method comprises the following steps: Carrying out back propagation calculation through a first supernode according to the received error and the held tensor to obtain a local model gradient corresponding to the tensor held by the first supernode, wherein the tensor is obtained by cutting parameters of a model to be trained, and the first supernode is any one of the plurality of supernodes; sending the local model gradient to the connected line cards through a plurality of first supernodes; The line card slices the received local model gradient into a plurality of cells, and sprays the cells to each screen plate; The network board sends the received cells to the line card connected with the second supernode; The line cards connected with the second supernodes recombine the received cells into the local model gradients, and the local model gradients obtained by recombination are sent to each second supernode, wherein the second supernodes are supernodes except the first supernodes in the plurality of supernodes; And the second supernode updates locally stored model parameters according to the received local model gradient and the locally stored local model gradient.
2. The method of claim 1, wherein the plurality of supernodes further comprises at least one third supernode, the third supernode being coupled to at least one of any of the line cards; The method further comprises the steps of: Detecting the running state of the first supernode; the third supernode loads tensors held by the first supernode under the condition that the abnormal running state of the first supernode is detected, and calculates corresponding local model gradients based on the tensors; And the third supernode sends the local model gradient to the connected line card.
3. The method according to claim 2, wherein the supernode system further comprises a recovered first supernode, and the recovered first supernode is connected with at least one arbitrary line card; After the sending, by the third supernode, the local model gradient to the connected line card, the method further includes: Under the condition that the first supernode detects that the running state is normal, the recovered first supernode loads tensors held by the third supernode, and calculates corresponding local model gradients based on the tensors; And sending the local model gradient to the connected line card through the restored first supernode.
4. The method of claim 1, wherein the method further comprises, prior to transmitting the local model gradient to the connected line card via the plurality of first supernodes: Performing reasoning calculation according to the received prompt words through a plurality of first supernodes to obtain answers of the prompt words; the first supernode calculates the strategy gradient of the model according to the scoring of the answer; And the first supernode obtains a local model gradient corresponding to the tensor held by the first supernode by back propagation calculation based on the strategy gradient and the held tensor.
5. The method of claim 1, wherein the line card slicing the received local model gradient into a plurality of cells and spraying the plurality of cells to each of the mesh plates comprises: The line card slices the received local model gradient into a plurality of cells, and adds an identifier of each second supernode for each cell, wherein the identifier of each second supernode is used for controlling the network board to send the cells to the line card connected with the second supernode after receiving the cells; and the line card sprays the cells added with the identification of the second supernode to each screen board.
6. A supernode system for model training, comprising: A first cabinet comprising a plurality of supernodes; the second cabinet comprises a plurality of line cards; A third cabinet including a plurality of mesh plates; The line cards and the mesh plates form a full interconnection structure, and each supernode in the plurality of supernodes is connected with at least one line card; the supernode comprises: The first supernode is used for carrying out back propagation calculation according to the received error and the held tensor to obtain a local model gradient corresponding to the tensor held by the first supernode, wherein the tensor is obtained by cutting parameters of a model to be trained, the first supernode is any one of the plurality of supernodes, and the local model gradient is sent to the connected line card; the second supernode is used for updating locally stored model parameters according to the received local model gradient and the locally stored local model gradient; the line card comprises: The first line card is used for slicing the received local model gradient into a plurality of cells and spraying the cells to each screen plate; the second line card is used for recombining the received cells into the local model gradients and sending the local model gradients obtained by recombination to each second supernode; And the network board is used for sending the received cells to the second line card connected with the second supernode.
7. The system of claim 6, wherein the first cabinet further comprises: the third supernode is connected with at least one arbitrary line card; the third supernode is used for loading tensors held by the first supernode through the third supernode and calculating corresponding local model gradients based on the tensors under the condition that the abnormal running state of the supernode is detected, and And the local model gradient is used for sending the local model gradient to the connected line card.
8. The system of claim 7, wherein the first enclosure further includes a restored first supernode, the restored first supernode being coupled to at least one of any of the line cards; The recovered first supernode is used for loading the tensor held by the third supernode and calculating the corresponding local model gradient based on the tensor under the condition that the normal running state of the first supernode is detected, and And sending the local model gradient to the connected line card.
9. The system of claim 6, wherein the system further comprises a controller configured to control the controller, The first supernode is also used for executing reasoning calculation according to the received prompt word to obtain the reply of the prompt word, and Calculating a policy gradient of the model based on scoring of the responses, and And based on the strategy gradient and the tensor held by the first supernode, carrying out back propagation calculation to obtain a local model gradient corresponding to the tensor held by the first supernode.
10. The system of claim 6, wherein the first line card is further configured to slice the received local model gradient into a plurality of cells and to add an identification of each of the second supernodes to each of the cells, wherein the identification of the second supernodes is used to control the net board to send the cells to the second line card connected to the second supernode after receiving the cells, and And spraying the cells added with the identification of the second supernode to each network board.
11. An electronic device, the electronic device comprising: A memory for storing computer executable instructions or computer programs; a processor for controlling supernodes, line cards and mesh boards in a supernode system when executing computer executable instructions or computer programs stored in the memory, implementing the method of any one of claims 1 to 5.
12. A computer readable storage medium storing computer executable instructions or a computer program, which when executed by a processor, controls supernodes, line cards and mesh boards in a supernode system to implement the method of any one of claims 1 to 5.
13. A computer program product comprising computer executable instructions or a computer program which, when executed by a processor, controls supernodes, line cards and mesh boards in a supernode system, implementing the method of any one of claims 1 to 5.

Description

Model training method, supernode system, electronic device, medium and program product Technical Field The present application relates to the field of computer technologies, and in particular, to a model training method, a supernode system, an electronic device, a medium, and a program product. Background With the advent of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) and high-performance computing (High Performance Computing, HPC), the data center's need for the scale of computing force clusters has increased. In large-scale computing clusters, frequent and high-throughput data transfers between graphics processors (Graphics Processing Unit, GPUs) are required. In the related art, a plurality of GPUs are integrated into a supernode system, the supernodes are connected with a plurality of other supernodes in the same supernode system in a point-to-point manner through a plurality of independent physical links, the network architecture of the supernode system is complex, model training is performed in the supernode system, and when a certain supernode running state abnormality occurs, the training interruption time is overlong, and the reliability is insufficient. Disclosure of Invention The embodiment of the application provides a model training method, a supernode system, electronic equipment, media and program products, which simplify the connection structure between supernodes and can shorten the time when the running state of the supernodes is abnormal, thereby improving the reliability of the supernode system. The technical scheme of the embodiment of the application is realized as follows: The embodiment of the application provides a model training method which is applied to a supernode system, wherein the supernode system comprises a plurality of supernodes, a plurality of line cards and a plurality of net plates, wherein the line cards and the net plates form a full interconnection structure, and each supernode in the plurality of supernodes is connected with at least one line card; The method comprises the following steps: Carrying out back propagation calculation through a first supernode according to the received error and the held tensor to obtain a local model gradient corresponding to the tensor held by the first supernode, wherein the tensor is obtained by cutting parameters of a model to be trained, and the first supernode is any one of the plurality of supernodes; sending the local model gradient to the connected line cards through a plurality of first supernodes; The line card slices the received local model gradient into a plurality of cells, and sprays the cells to each screen plate; The network board sends the received cells to the line card connected with the second supernode; The line cards connected with the second supernodes recombine the received cells into the local model gradients, and the local model gradients obtained by recombination are sent to each second supernode, wherein the second supernodes are supernodes except the first supernodes in the plurality of supernodes; And the second supernode updates locally stored model parameters according to the received local model gradient and the locally stored local model gradient. The embodiment of the application provides a supernode system for model training, which comprises the following steps: a first cabinet comprising a plurality of supernodes, A second cabinet, comprising a plurality of line cards, A third cabinet comprising a plurality of net plates, The supernode comprises: The first supernode is used for carrying out back propagation calculation according to the received error and the held tensor to obtain a local model gradient corresponding to the tensor held by the first supernode, wherein the tensor is obtained by cutting parameters of a model to be trained, the first supernode is any one of the plurality of supernodes, and the local model gradient is sent to the connected line card; the second supernode is used for updating locally stored model parameters according to the received local model gradient and the locally stored local model gradient; the line card comprises: The first line card is used for slicing the received local model gradient into a plurality of cells and spraying the cells to each screen plate; the second line card is used for recombining the received cells into the local model gradients and sending the local model gradients obtained by recombination to each second supernode; And the network board is used for sending the received cells to the second line card connected with the second supernode. An embodiment of the present application provides an electronic device, configured to serve as a control node in a supernode system, where the electronic device includes: A memory for storing computer executable instructions or computer programs; And the processor is used for controlling the supernode, the line card and the net plate in the supernode system to realize the model training method when e