Search

CN-121996448-A - Method and device for fault diagnosis

CN121996448ACN 121996448 ACN121996448 ACN 121996448ACN-121996448-A

Abstract

The present disclosure provides a method and apparatus for fault diagnosis, a computing device, a computer-readable storage medium, a computer program product, and a chip. In some embodiments, the diagnostic information corresponding to the input information describing the fault may be obtained by retrieving the enhanced generation using a fault database storing causal links associated with the fault information. The fault database can store richer and more comprehensive information related to faults, so that the root cause of the faults can be rapidly determined based on a causal chain in the process of generating the determination diagnosis information through retrieval enhancement, the fault location is more rapidly realized, the efficiency of fault processing is improved, the dependence on operation and maintenance personnel is reduced, and the cost is further reduced.

Inventors

  • LIANG FEI
  • XU JIACHEN
  • LI JUN
  • CHANG JIA
  • LIU ZIQI
  • PAN SHIJIE
  • WANG YAOYUAN
  • ZHANG ZIYANG

Assignees

  • 华为技术有限公司

Dates

Publication Date
20260508
Application Date
20241108

Claims (13)

  1. 1. A method for fault diagnosis, comprising: Acquiring input information describing the fault, and Generating a RAG using search enhancement based on a fault database storing causal links related to fault information, determining diagnostic information corresponding to the input information describing the fault.
  2. 2. The method of claim 1, wherein the fault database is obtained by: acquiring data associated with system operation, and The fault database is generated based on the data using causal analysis.
  3. 3. The method of claim 2, wherein the data comprises at least one of log document data, trace call data, monitored key performance indicator KPI data, or historical fault diagnosis data.
  4. 4. A method according to claim 2 or 3, wherein acquiring data associated with system operation comprises: Data associated with system operation obtained using an event extraction method is obtained.
  5. 5. The method of any one of claims 2 to 4, wherein the causal analysis comprises bayesian causal inference or scoring causal discovery.
  6. 6. The method of any of claims 1-5, wherein determining the diagnostic information comprises: Generating prompt information related to the input information describing the faults based on the fault database, and Generating the diagnosis information corresponding to the input information describing the fault based on the prompt information using a RAG big model.
  7. 7. The method of claim 6, wherein generating the hint information comprises: Converting the input information describing the faults into structured input information by using a word embedding model; determining a plurality of information items associated with the structured input information from the fault database by vector retrieval, and The hint information is generated by reordering the plurality of information items.
  8. 8. The method of any of claims 1 to 7, wherein the fault database is updated based on new fault information.
  9. 9. The method according to any one of claims 1 to 8, wherein the causal links related to fault information are used to represent causal relationships between different faults.
  10. 10. An apparatus for fault diagnosis comprising means for implementing the method according to any one of claims 1 to 9.
  11. 11. A computing device, comprising: one or more processors, and A memory storing instructions that, when executed by the one or more processors, cause the computing device to perform the method of any of claims 1-9.
  12. 12. A computer program product storing instructions that, when executed, cause an apparatus to perform the method according to any one of claims 1 to 9.
  13. 13. A chip or chip system configured to perform the method according to any one of claims 1 to 9.

Description

Method and device for fault diagnosis Technical Field Embodiments of the present disclosure relate generally to the field of computers, and more particularly, to a method and apparatus for fault diagnosis. Background With the gradual expansion of the scale of cloud, computing clusters and the like, the cost of manual maintenance caused by the occurrence of faults is greatly increased. The fault diagnosis technology based on intelligent operation and maintenance can help operators to quickly find faults, so that the fault solving efficiency is improved. However, the scale of the cloud and the computing clusters is still continuously expanding, the managed objects are also extended from the physical devices to the virtual machines, and the scale of network element management is also continuously expanding, and therefore, more effective schemes for fault diagnosis are needed. Disclosure of Invention The scheme for fault diagnosis is provided, the diagnosis information is determined by using the fault database which stores causal chains related to the fault information, so that the fault location can be realized more quickly, and the efficiency of fault processing is improved. In a first aspect of the present disclosure, a method for fault diagnosis is provided that includes obtaining input information describing a fault and determining diagnostic information corresponding to the input information describing the fault using a retrieval enhancement generation (RAG) based on a fault database, wherein the fault database stores causal links related to the fault information. In this way, as the fault database stores the causal chains related to the fault information, namely, the information related to the fault is more abundant and comprehensive, the root cause of the fault can be rapidly determined based on the causal chains in the process of generating the determination diagnosis information through retrieval enhancement, the fault location is more rapidly realized, the efficiency of fault processing is improved, the dependence on operation and maintenance personnel is reduced, and the cost is further reduced. In some implementations, determining the diagnostic information includes generating, based on the fault database, hint information related to input information describing the fault, and generating, based on the hint information, diagnostic information corresponding to the input information describing the fault using the RAG big model. In this way, embodiments of the present disclosure can generate hint information based on a fault database, thereby enabling a large model to adapt to fault diagnosis problems, thereby enabling more accurate diagnostic information to be derived. In some implementations, generating the hint information includes converting input information describing the fault to structured input information using a word embedding model, determining a plurality of information items associated with the structured input information from a fault database via vector retrieval, and generating the hint information by reordering the plurality of information items. Alternatively, the reordering is performed by using a reordering model or a large language model. In this way, the information items arranged in front can be screened out by reordering, thereby enhancing the generation of the hint information. Because the causal chain is stored in the fault database, the efficiency and accuracy of generating prompt information through reordering in the online diagnosis stage can be improved. Thus, using reordering in search enhancement generation can improve the accuracy and quality of answers. In some implementations, the causal links associated with fault information may represent causal relationships of different fault times. Therefore, the causal chain can be used for quickly determining the root cause of the fault, and efficient fault positioning is realized. In some implementations, the fault database is obtained by obtaining data associated with system operation and generating the fault database based on the data using causal analysis. In this way, the fault database can store more abundant information through causal analysis, so that subsequent inquiry of the fault database is facilitated, and management and analysis of faults of a complex system can be facilitated. In some implementations, the data includes at least one of log document data, trace call data, monitored key performance indicator (key performance indicator, KPI) data, or historical fault diagnosis data. In some implementations, the data includes data in an actual fault scenario, as well as data in a fault scenario generated by way of binary instrumentation or fault injection. In this way, more abundant data can be acquired, thereby ensuring that fault databases established based on the data can more fully record fault-related information. In some implementations, acquiring data associated with system operation includes acquiring data as