CN-122027458-A - Fault detection method and device for intelligent computing platform, electronic equipment and medium
Abstract
The disclosure provides a fault detection method, device, electronic equipment and medium of an intelligent computing platform, wherein the method comprises the steps of obtaining heterogeneous data of at least one of an infrastructure layer, a platform service layer and a business system layer in the intelligent computing platform, carrying out fusion processing on the heterogeneous data to obtain a target data stream, decomposing at least one of a fault analysis task, a fault positioning task and a fault recovery task of the intelligent computing platform according to the target data stream to obtain at least one subtask, determining a diagnosis context according to the target data stream, determining at least one model service through the diagnosis context, and obtaining a fault detection result or a fault recovery result of the intelligent computing platform according to the diagnosis context, the at least one subtask and the at least one model service.
Inventors
- WU JIXIN
- CHANG YUE
- WANG JIELI
- FENG YUANYUAN
- HOU JINFENG
- YUAN JING
- XU LI
- WANG JIANG
- GU MING
- LUO XINMEI
- YANG HAI
- LI HUI
- ZHAO YU
- ZHENG QING
- ZHOU YIFEI
Assignees
- 中国移动通信集团设计院有限公司
- 中国移动通信集团有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20251224
Claims (13)
- 1. A fault detection method for an intelligent computing platform, comprising: heterogeneous data of at least one of an infrastructure layer, a platform service layer and a business system layer in the intelligent computing platform are acquired, and fusion processing is carried out on the heterogeneous data to obtain a target data stream; decomposing at least one of a fault analysis task, a fault positioning task and a fault recovery task of the intelligent computing platform according to the target data stream to obtain at least one subtask; Determining a diagnostic context from said target data stream and determining at least one model service by means of said diagnostic context, and And acquiring a fault detection result or a fault recovery result of the intelligent computing platform according to the diagnosis context, the at least one subtask and the at least one model service.
- 2. The method of claim 1, wherein the fusing the at least one heterogeneous data to obtain the target data stream comprises: Carrying out standardized processing on the at least one heterogeneous data through a unified data model and an interface specification to obtain at least one processed heterogeneous data; Performing alignment processing on the at least one piece of processed heterogeneous data to obtain at least one piece of aligned heterogeneous data; and fusing the at least one aligned heterogeneous data to obtain the target data stream.
- 3. The method of claim 2, wherein the aligning the at least one processed heterogeneous data to obtain at least one aligned heterogeneous data comprises: determining a data source characteristic corresponding to the at least one processed heterogeneous data, wherein the data source characteristic comprises at least one of clock accuracy, network delay, timestamp quality, whether an identifiable event characteristic exists in the data source; determining an adaptive time alignment method according to the data source characteristics; And carrying out alignment treatment on the at least one piece of processed heterogeneous data through the self-adaptive time alignment method to obtain the at least one piece of aligned heterogeneous data.
- 4. The method according to claim 1, wherein the method further comprises: Constructing a multi-agent cooperative architecture based on an inter-agent communication A2A protocol, wherein, The multi-agent cooperative architecture comprises a central agent and a plurality of post agents, wherein the central agent is used for calling at least one post agent based on an asynchronous task interface of the A2A protocol, and the at least one post agent is used for executing at least one of a fault analysis task, a fault positioning task and a fault recovery task.
- 5. The method of claim 4, wherein the hub agent and the post agent are located in a centralized service, one model service corresponds to one model node, at least one model node is located in at least one heterogeneous computing center, and the model service supports a model context protocol MCP.
- 6. The method of claim 4, wherein said constructing a multi-agent architecture based on the A2A protocol comprises: An agent card is created for each agent, wherein, The intelligent agent is the central intelligent agent or the post intelligent agent, and the intelligent agent card comprises at least one of the name of the intelligent agent, the capability of the intelligent agent, the input data format of the intelligent agent, the output data format of the intelligent agent and the service endpoint address of the intelligent agent; constructing a service endpoint for the post agent and opening the service endpoint to a task submission interface supporting the A2A protocol, And the post agent receives the task call request sent by the central agent or other post agents through the service endpoint, and returns the task state or execution result.
- 7. The method of claim 6, wherein the method further comprises: receiving an initial task request input by a user, and carrying out intention recognition and semantic supplementation on the initial task request to obtain a processed task request, wherein the initial task request is used for triggering and executing at least one of the fault analysis task, the fault positioning task and the fault recovery task; Acquiring an intelligent card of the registered post intelligent agent, and processing the processed task request according to the content described by the intelligent card to obtain task description information; Determining an optimal agent cooperation path for executing the at least one subtask according to the task description information and a first capability topological graph, wherein the first capability topological graph is used for describing the capabilities of the registered post agents and the connection relation between different post agents, the optimal agent cooperation path comprises the at least one post agent, and And dynamically updating the first capability topological graph according to the task state or the execution result to obtain a second capability topological graph.
- 8. The method of claim 7, wherein the method further comprises: monitoring response time and/or success rate indexes of the calling links of the at least one post agent, and identifying abnormal links from the at least one calling links according to the response time and/or success rate indexes; Automatically triggering a fusing mechanism for the abnormal link to interrupt a subsequent task request related to the abnormal link and not distributing a new task to a post agent related to the abnormal link within a preset duration; and updating the optimal agent cooperation path based on the second capability topological graph to obtain an updated agent cooperation path.
- 9. The method of claim 7, wherein the method further comprises: Defining an MCP protocol specification, wherein the MCP protocol specification is used for transmitting heterogeneous data, a diagnosis context, an intermediate result and the fault detection result or the fault recovery result between different agents and between the agents and a data collector, and the data collector is used for collecting the heterogeneous data; Constructing an MCP service core component, wherein the MCP service core component is used for realizing the registration and discovery functions of model service, a data source and a data collector, and the data source is at least one of an infrastructure layer, a platform service layer and a business system layer; packaging fault diagnosis capability as the model service and providing a feature extraction Application Programming Interface (API), wherein the feature extraction API supports the accessed model service or an external tool to extract data features from the target data stream as required, and injecting features related to the fault detection result or the fault recovery result into a feature library of the intelligent computing platform; And normalizing the interface of the agent and the interface of the data collector to output data in a format conforming to the MCP protocol specification.
- 10. A fault detection device for an intelligent computing platform, comprising: The acquisition module is used for acquiring heterogeneous data of at least one of an infrastructure layer, a platform service layer and a business system layer in the intelligent computing platform, and carrying out fusion processing on at least one heterogeneous data to obtain a target data stream; The decomposing module is used for decomposing at least one of a fault analysis task, a fault positioning task and a fault recovery task of the intelligent computing platform according to the target data stream to obtain at least one subtask; A determining module for determining a diagnostic context from the target data stream and determining at least one model service by the diagnostic context; and the detection module is used for acquiring a fault detection result or a fault recovery result of the intelligent computing platform according to the diagnosis context, the at least one subtask and the at least one model service.
- 11. An electronic device comprising a processor and a memory communicatively coupled to the processor; The memory stores computer-executable instructions; The processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1-9.
- 12. A non-transitory computer readable storage medium having stored therein computer executable instructions which when executed by a processor are for implementing the method of any of claims 1-9.
- 13. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-9.
Description
Fault detection method and device for intelligent computing platform, electronic equipment and medium Technical Field The disclosure relates to the technical field of intelligent operation and maintenance, and in particular relates to a fault detection method, device, electronic equipment and medium of an intelligent computing platform. Background In order to ensure long-term stable operation of the intelligent computing center, maintenance and management of the intelligent computing platform (or called as the intelligent computing center) are required based on an operation and maintenance scheme. There are some typical failure scenarios for intelligent computing platforms, high speed light module degradation, neural network processor (Neural Processing Unit NPU) or graphics processor (Graphics Processing Unit, GPU) damage. In the related art, most links in the operation and maintenance process for the faults depend on manual operation. In this way, the failure handling timeliness and efficiency of the intelligent computing platform are not high, and the performance stability of the intelligent computing platform is affected. Disclosure of Invention The present disclosure aims to solve, at least to some extent, one of the technical problems in the related art. Therefore, the disclosure provides a fault detection method, a fault detection device, an electronic device, a non-transitory computer readable storage medium and a computer program product for a smart computing platform, which can effectively improve the timeliness and efficiency of fault handling of the smart computing platform and enhance the performance stability of the smart computing platform. An embodiment of a first aspect of the present disclosure provides a fault detection method for an intelligent computing platform, including obtaining heterogeneous data of at least one of an infrastructure layer, a platform service layer and a service system layer in the intelligent computing platform, and performing fusion processing on the at least one heterogeneous data to obtain a target data stream, decomposing at least one of a fault analysis task, a fault positioning task and a fault recovery task of the intelligent computing platform according to the target data stream to obtain at least one subtask, determining a diagnosis context according to the target data stream, determining at least one model service according to the diagnosis context, and obtaining a fault detection result or a fault recovery result of the intelligent computing platform according to the diagnosis context, the at least one subtask and the at least one model service. The embodiment of the second aspect of the disclosure provides a fault detection device of an intelligent computing platform, which comprises an acquisition module, a decomposition module, a determination module and a detection module, wherein the acquisition module is used for acquiring heterogeneous data of at least one of an infrastructure layer, a platform service layer and a business system layer in the intelligent computing platform and carrying out fusion processing on the heterogeneous data to obtain a target data stream, the decomposition module is used for decomposing at least one of a fault analysis task, a fault positioning task and a fault recovery task of the intelligent computing platform according to the target data stream to obtain at least one subtask, the determination module is used for determining a diagnosis context according to the target data stream and determining at least one model service through the diagnosis context, and the detection module is used for acquiring a fault detection result or a fault recovery result of the intelligent computing platform according to the diagnosis context, the at least one subtask and the at least one model service. An embodiment of a third aspect of the present disclosure provides an electronic device, including a processor, and a memory communicatively connected to the processor, where the memory stores computer-executable instructions, and where the processor executes the computer-executable instructions stored in the memory, to implement a fault detection method of an intelligent computing platform as provided in an embodiment of the first aspect of the present disclosure. An embodiment of a fourth aspect of the present disclosure proposes a non-transitory computer-readable storage medium, in which computer-executable instructions are stored, which when executed by a processor are configured to implement a fault detection method of an intelligent computing platform as proposed in an embodiment of the first aspect of the present disclosure. Embodiments of a fifth aspect of the present disclosure provide a computer program product comprising a computer program which, when executed by a processor, implements a fault detection method for a smart computing platform as provided by embodiments of the first aspect of the present disclosure. The fault detecti