CN-121979764-A - Artificial intelligence cluster fault processing method, device, equipment and medium

CN121979764ACN 121979764 ACN121979764 ACN 121979764ACN-121979764-A

Abstract

The invention discloses an artificial intelligent cluster fault processing method, device, equipment and medium, wherein the method comprises the steps of generating an environment snapshot of the current running environment of an artificial intelligent cluster under the condition that the artificial intelligent cluster breaks down, determining target hardware resources and target software environments matched with the environment snapshot in a resource pool corresponding to the artificial intelligent cluster, carrying out fault reproduction of the artificial intelligent cluster according to the target hardware resources and the target software environments, providing an operable reproduction scheme for the fault of the artificial intelligent cluster which is difficult to reproduce, providing a relatively stable debugging basis for a developer, forming a complete flow from fault information acquisition to field reproduction, being beneficial to improving the success rate of fault reproduction, and having positive effects of shortening average fault repair time and improving the operation and maintenance efficiency of the artificial intelligent cluster.

Inventors

Request for anonymity

Assignees

摩尔线程智能科技(北京)股份有限公司

Dates

Publication Date: 20260505
Application Date: 20251212

Claims (12)

1. An artificial intelligence cluster fault handling method, the method comprising: generating an environment snapshot of the current operating environment of the artificial intelligent cluster under the condition that the artificial intelligent cluster fails; Determining target hardware resources and target software environments matched with the environment snapshot in a resource pool corresponding to the artificial intelligent cluster; and performing fault reproduction of the artificial intelligent cluster according to the target hardware resource and the target software environment.
2. The method for processing the fault of the artificial intelligent cluster according to claim 1, wherein the environment snapshot comprises hardware configuration snapshot information and software configuration snapshot information, and the determining, in a resource pool corresponding to the artificial intelligent cluster, a target hardware resource and a target software environment matched with the environment snapshot comprises: and determining target hardware resources matched with the hardware configuration snapshot information in a resource pool corresponding to the artificial intelligent cluster, and determining target software environment matched with the software configuration snapshot information.
3. The method according to claim 2, wherein the software configuration snapshot information comprises code information and/or software stack information, and the hardware configuration snapshot information comprises hardware context information and/or configuration parameter information of the artificial intelligent cluster.
4. The method for processing the fault of the artificial intelligent cluster according to claim 3, wherein the target hardware resource comprises a first target hardware resource, and the determining the target hardware resource matched with the hardware configuration snapshot information in the resource pool corresponding to the artificial intelligent cluster comprises: determining the model number of at least one processing unit, the number of the processing units and the network topology relation of the processing units according to the hardware context information: and determining a first target hardware resource matched with the model number of the at least one processing unit, the number of the processing units and the network topological relation of the processing units in a resource pool corresponding to the artificial intelligent cluster according to a preset resource matching rule, wherein the preset resource matching rule is used for guiding how to perform resource matching.
5. The method for processing the fault of the artificial intelligent cluster according to claim 4, wherein the target hardware resource includes a second target hardware resource, and the determining, in the resource pool corresponding to the artificial intelligent cluster, the target hardware resource that matches the hardware configuration snapshot information includes: Determining the model number of at least one candidate processing unit, the number of candidate processing units and the network topology relation of the candidate processing units according to the code information under the condition that a first target hardware resource matched with the model number of at least one processing unit, the number of the processing units and the network topology relation of the processing units fails according to a preset resource matching rule in a resource pool corresponding to the artificial intelligent cluster; and determining a second target hardware resource matched with the model number of the at least one candidate processing unit, the number of the candidate processing units and the network topological relation of the candidate processing units in a resource pool corresponding to the artificial intelligent cluster according to a preset resource matching rule.
6. The artificial intelligence cluster fault handling method of claim 5, wherein determining the model number of at least one candidate processing unit, the number of candidate processing units, and the network topology relationship of the candidate processing units based on the code information comprises: Acquiring historical error reporting information, and model numbers of historical processing units, number of the historical processing units and network topological relation of the historical processing units, wherein the model numbers and the number of the historical processing units are in one-to-one correspondence with the historical error reporting information; determining error reporting information according to the code information; Determining target historical error reporting information matched with the error reporting information in the historical error reporting information; And taking the model number of the history processing units, the number of the history processing units and the network topology relation of the history processing units corresponding to the target history error reporting information as the model number of at least one candidate processing unit, the number of the candidate processing units and the network topology relation of the candidate processing units.
7. The artificial intelligence cluster fault handling method of claim 3, wherein the determining a target software environment that matches the software configuration snapshot information comprises: determining a target drive version according to the software stack information; determining an object code according to the code information; and determining a target software environment according to the target drive version and the target code.
8. The method for processing the fault of the artificial intelligent cluster according to claim 1, wherein the method further comprises releasing the target hardware resource when a release condition is met after the fault reproduction of the artificial intelligent cluster is completed, and the release condition comprises that the idle time of the target hardware resource reaches a preset threshold or a manual release instruction is received.
9. The artificial intelligence cluster fault handling method of claim 2, further comprising: after the fault reproduction of the artificial intelligent cluster is successful according to the target hardware resource and the target software environment, recording the fault of the artificial intelligent cluster as a historical fault; And taking the repaired software environment aiming at the historical fault as a new target software environment, and performing fault repair verification on the historical fault according to the new target software environment and the target hardware resource.
10. An artificial intelligence cluster fault handling device, the device comprising: The environment snapshot generating module is used for generating an environment snapshot of the current running environment of the artificial intelligent cluster under the condition that the artificial intelligent cluster fails; the resource and environment determining module is used for determining target hardware resources and target software environments matched with the environment snapshot in a resource pool corresponding to the artificial intelligent cluster; And the fault reproduction module is used for carrying out fault reproduction of the artificial intelligent cluster according to the target hardware resource and the target software environment.
11. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the artificial intelligence cluster fault handling method of claims 1-9.
12. A readable storage medium, characterized in that it has stored thereon a program or instructions which, when executed by a processor, implement the steps of the artificial intelligence cluster fault handling method of claims 1-9.

Description

Artificial intelligence cluster fault processing method, device, equipment and medium Technical Field The invention belongs to the technical field of artificial intelligence, and particularly relates to an artificial intelligence cluster fault processing method, device, equipment and medium. Background In the context of rapid development of artificial intelligence technology, an artificial intelligence cluster is used as a core infrastructure for carrying large-scale model training and reasoning tasks, and stable operation of the artificial intelligence cluster is critical to business continuity. However, artificial intelligence clusters are generally composed of a large number of heterogeneous hardware resources, and the running environment involves complex software stacks, dependency libraries and configuration parameters, so that faults are extremely easy to be caused by various factors such as hardware faults, software conflicts, resource competition or configuration abnormality in the actual running process. When a fault occurs, because the cluster operation environment is complex and has instantaneity when the fault occurs, the fault scene cannot be accurately reproduced in a local or test environment when the fault cause is subsequently checked, so that the fault diagnosis takes longer time, and the operation and maintenance efficiency and the service recovery speed of the artificial intelligent cluster are seriously affected. Disclosure of Invention In view of the above problems, embodiments of the present invention are provided to provide an artificial intelligence cluster fault processing method, apparatus, device, and medium, which overcome the above problems that the fault scenario cannot be accurately reproduced in a local or test environment, so that fault diagnosis takes a long time, and the operation and maintenance efficiency and service recovery speed of an artificial intelligence cluster are seriously affected, or at least partially solve the above problems. In a first aspect, an embodiment of the present invention provides a method for processing an artificial intelligence cluster fault, where the method includes: generating an environment snapshot of the current operating environment of the artificial intelligent cluster under the condition that the artificial intelligent cluster fails; Determining target hardware resources and target software environments matched with the environment snapshot in a resource pool corresponding to the artificial intelligent cluster; and performing fault reproduction of the artificial intelligent cluster according to the target hardware resource and the target software environment. Optionally, the environment snapshot includes hardware configuration snapshot information and software configuration snapshot information, and determining, in a resource pool corresponding to the artificial intelligence cluster, a target hardware resource and a target software environment that are matched with the environment snapshot, including: and determining target hardware resources matched with the hardware configuration snapshot information in a resource pool corresponding to the artificial intelligent cluster, and determining target software environment matched with the software configuration snapshot information. Optionally, the software configuration snapshot information comprises code information and/or software stack information, and the hardware configuration snapshot information comprises hardware context information and/or configuration parameter information of the artificial intelligent cluster. Optionally, the target hardware resource includes a first target hardware resource, and determining, in a resource pool corresponding to the artificial intelligence cluster, a target hardware resource that matches the hardware configuration snapshot information includes: determining the model number of at least one processing unit, the number of the processing units and the network topology relation of the processing units according to the hardware context information: and determining a first target hardware resource matched with the model number of the at least one processing unit, the number of the processing units and the network topological relation of the processing units in a resource pool corresponding to the artificial intelligent cluster according to a preset resource matching rule, wherein the preset resource matching rule is used for guiding how to perform resource matching. Optionally, the target hardware resource comprises a second target hardware resource, the method further comprising: Determining the model number of at least one candidate processing unit, the number of candidate processing units and the network topology relation of the candidate processing units according to the code information under the condition that a first target hardware resource matched with the model number of at least one processing unit, the number of the processing units and the