KR-20260064520-A - METHOD AND SYSTEM FOR FAULT TOLERANCE IN A COMPUTING SYSTEM INCLUDING A PLURALITY OF PROCESSING UNITS

KR20260064520AKR 20260064520 AKR20260064520 AKR 20260064520AKR-20260064520-A

Abstract

The present disclosure relates to a fault-tolerance method in a computational system comprising a plurality of computational units performed by at least one processor, comprising the steps of: distributing and storing computational result data derived through each of a plurality of computational units included in a group in the memory of the plurality of computational units; and, when a failure occurs in at least one of the plurality of computational units, recovering the memory of the at least one computational unit based on data blocks stored in the memory of the remaining computational units excluding the at least one computational unit.

Inventors

조강원
박정호
정우근

Assignees

주식회사 모레

Dates

Publication Date: 20260507
Application Date: 20251013
Priority Date: 20241030

Claims (12)

A fault-tolerance method in a computational system comprising a plurality of computational units executed by at least one processor, A step of distributing and storing computation result data derived through each of a plurality of computation units included in one group into the memory of the plurality of computation units; and When a failure occurs in at least one of the plurality of computational units, the step of recovering the memory of the at least one computational unit based on data blocks stored in the memories of the remaining computational units excluding the at least one computational unit. A fault-tolerant method in a computational system comprising a plurality of computational units, including
In paragraph 1, The above-mentioned distributed storage step is, A step of generating a plurality of block data by dividing the above operation result data into block units; and Step of redundantly storing the plurality of block data generated above in the memory of each of the plurality of computation units Includes, The above recovery step is, In the event that a failure occurs in at least one operation unit, the step of recovering the memory of the at least one operation unit using block data stored in the memory of the remaining operation units excluding the at least one operation unit. A fault-tolerant method in a computational system comprising a plurality of computational units, including
In paragraph 1, The above-mentioned distributed storage step is, If the above-mentioned group includes n operation units, a step of generating n-1 block data by dividing the operation result data into block units; A step of generating one parity data based on the n-1 block data generated above; and The step of distributing and storing the generated n-1 block data and the generated 1 parity data in the memory of each of the n operation units. Includes, The above recovery step is, When a failure occurs in a specific operation unit among the n operation units, the step of recovering the memory of the specific operation unit using block data and parity data stored in the remaining operation units excluding the specific operation unit. A fault-tolerant method in a computational system comprising a plurality of computational units, including
In paragraph 1, The above-mentioned distributed storage step is, When n operation units are included in the above-mentioned group, a step of generating n-2 block data by dividing the operation result data into block units; A step of generating two parity data based on the n-2 block data generated above; and The step of distributing and storing the generated n-2 block data and the generated 2 parity data in the memory of each of the n operation units. Includes, The above recovery step is, If a failure occurs in 2 of the n operation units, the step of recovering the memory of the 2 operation units using block data stored in the remaining operation units excluding the 2 operation units and 2 parity data. A fault-tolerant method in a computational system comprising a plurality of computational units, including
In paragraph 1, The above-mentioned distributed storage step is, A step of dividing the above-mentioned group to create a first unit group and a second unit group; A step of generating a plurality of block data by dividing the above operation result data into block units; and The method comprises the step of alternately distributing and storing the generated plurality of block data among the operation units included in the first unit group and the operation units included in the second unit group, and storing the same block data redundantly in each of the operation units included in the first unit group and the operation units included in the second unit group. Includes, The above recovery step is, When a specific operation unit included in the first or second unit group fails, the step of recovering the memory of the specific operation unit based on block data redundantly stored in the memory of the operation unit included in the same group and block data distributedly stored in the memory of the operation unit included in the other group. A fault-tolerant method in a computational system comprising a plurality of computational units, including
In paragraph 1, The number of operation units included in the above-mentioned group is, A method for allowing faults in a computational system including a plurality of computational units, characterized by being determined based on at least one of the number of failures and the failure rate over a predetermined period for a plurality of computational units included in the above computational system.
In paragraph 1, The above-mentioned distributed storage step is, A step of determining a distributed storage method based on the number of the plurality of computational units; and A step of distributing and storing computation result data in the memory of the plurality of computation units according to the distributed storage method determined above. A fault-tolerant method in a computational system comprising a plurality of computational units, including
In paragraph 1, The above-mentioned distributed storage step is, A step of determining a distributed storage method based on the type and purpose of data used in the operation; and A step of distributing and storing computation result data in the memory of the plurality of computation units according to the distributed storage method determined above. A fault-tolerant method in a computational system comprising a plurality of computational units, including
In paragraph 1, The above-mentioned distributed storage step is, A step of determining a distributed storage method based on the failure rate in the above-mentioned computational system during a predetermined period; and A step of distributing and storing computation result data in the memory of the plurality of computation units according to the distributed storage method determined above. A fault-tolerant method in a computational system comprising a plurality of computational units, including
In paragraph 1, The above recovery step is, In the event that a failure occurs in at least one operation unit, the step of including at least one idle operation unit that is not included in the one group among the operation units included in the operation system in the one group; and A step of recovering the memory of the at least one operation unit using the at least one idle operation unit. A fault-tolerant method in a computational system comprising a plurality of computational units, including
A computer program stored on a computer-readable recording medium for executing a method according to any one of paragraphs 1 through 10 on a computer.
As an information processing system, Memory; and At least one processor connected to the memory and configured to execute at least one computer-readable program contained in the memory. Includes, The above at least one program is, Data of calculation results derived through each of a plurality of calculation units included in a group is distributed and stored in the memory of the plurality of calculation units, and An information processing system comprising instructions for recovering the memory of at least one operation unit based on data blocks stored in the memory of the remaining operation units excluding the at least one operation unit when a failure occurs in at least one operation unit among the plurality of operation units.

Description

Method and System for Fault Tolerance in a Computing System Including a Multiplicity of Processing Units The present disclosure relates to a fault-tolerant method and system in a computational system comprising a plurality of computational units, and more specifically, to a fault-tolerant method and system in a computational system comprising a plurality of computational units capable of failure recovery without checkpointing through distributed storage of computational results. In computing systems that include multiple computing units, particularly Graphics Processing Units (GPUs), failures of individual computing units may occur during long-duration computations. When such failures occur, the results of the entire computation may be lost, or the process may revert to a checkpoint saved prior to the failure to perform a re-execution, resulting in significant loss of time and resources. In conventional GPU failure handling methods, the method of periodically saving the intermediate state of computation results as checkpoints is widely used. Excluding the time required for saving checkpoints, the time required for GPU failure recovery can be broadly divided into three segments based on the time between failures (fault-to-fault latency). The first segment (A) is the execution time from the previous checkpoint until the failure occurs, and all computation results generated during this segment are lost upon failure. The second segment (B) is the delay time from detecting the failure until a restart is performed; if the identification of the problem and the initiation of the re-execution procedure are delayed, the total recovery time increases. The third segment (C) is the time from the restart until the last checkpoint is retrieved and recovery is completed; the length of this segment is determined by the size of the recovery data and the I/O speed. Existing technologies employ methods such as introducing automatic restart functions or strengthening fault monitoring to reduce the time in section B, but there are fundamental limitations in reducing the lost time occurring in sections A and C. In particular, when performing large-scale training tasks using a very large number of GPUs, the interval between failures becomes shorter and the fault-to-fault latency becomes significantly shorter, which can drastically reduce overall computational efficiency. The aforementioned background technology is one that the inventor possessed or acquired in the process of deriving the contents of the disclosure of the present application, and it cannot be considered as prior art disclosed to the general public prior to the filing of this application. Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, wherein similar reference numerals indicate similar elements, but are not limited thereto. FIG. 1 is a diagram showing examples of a main process and a sub-process for fault tolerance in a computational system including a plurality of computational units according to one embodiment of the present disclosure. FIG. 2 is a block diagram showing the internal configuration of an information processing system according to one embodiment of the present disclosure. FIG. 3 is a flowchart of a fault-allowing method in a computational system including a plurality of computational units according to one embodiment of the present disclosure. FIG. 4 is a diagram illustrating, by way of example, a first distributed storage method according to one embodiment of the present disclosure. FIG. 5 is a diagram exemplarily illustrating a second distributed storage method according to one embodiment of the present disclosure. FIG. 6 is a diagram illustrating a third distributed storage method according to one embodiment of the present disclosure in an exemplary manner. FIG. 7 is a diagram illustrating, by way of example, a fourth distributed storage method according to one embodiment of the present disclosure. FIG. 8 is a diagram illustrating, by way of example, a fifth distributed storage method according to one embodiment of the present disclosure. FIG. 9 is a flowchart of a method for distributing and storing data according to the number of computational units according to one embodiment of the present disclosure. Hereinafter, specific details for implementing the present disclosure will be described in detail with reference to the attached drawings. However, in the following description, specific descriptions regarding widely known functions or configurations will be omitted if there is a risk that the gist of the present disclosure may be unnecessarily obscured. In the attached drawings, identical or corresponding components are assigned the same reference numerals. Additionally, in the description of the following embodiments, the description of identical or corresponding components may be omitted. However, even if a description of a component is omitted, it is not intended that such