Search

CN-122001746-A - Fault diagnosis method and device based on causal state space prompt, medium and electronic equipment

CN122001746ACN 122001746 ACN122001746 ACN 122001746ACN-122001746-A

Abstract

The application provides a fault diagnosis method, a device, a medium and electronic equipment based on causal state space prompt, and relates to the field of data processing, comprising the steps of mapping each IT operation and maintenance event into an event feature vector; the method comprises the steps of inputting an on-line state space model according to time sequence, outputting potential hidden states corresponding to each time step, mapping the potential hidden states corresponding to the current time step into a plurality of discrete state words in response to receiving an IT operation and maintenance diagnosis request, splicing a problem description corresponding to IT operation and maintenance service interruption, an IT operation and maintenance event occurring in a target time window and the discrete state words into a target prompt, and inputting the target prompt into a frozen large language model to obtain a fault diagnosis result. The application not only has the advantage of natural language reasoning of the large language model, but also avoids the illusion problem caused by limited contextual windows and missing key causal information, and simultaneously greatly reduces training cost and calculation power consumption by freezing the use mode of the large language model.

Inventors

  • REN WENCAN
  • ZHU JUN
  • RAO YUAN
  • CHEN DAOCUN
  • LIU JIE
  • XIE XIANHONG
  • Hou Dongfu
  • WANG FENGLIANG

Assignees

  • 安徽农业大学

Dates

Publication Date
20260508
Application Date
20260408

Claims (10)

  1. 1. A fault diagnosis method based on causal state space cues, the method comprising: collecting a plurality of IT operation and maintenance events; Mapping each IT operation and maintenance event into a corresponding event feature vector; Continuously inputting event feature vectors into a online state space model according to time sequence, and outputting potential hidden states corresponding to each time step, wherein the online state space model has bounded memory capacity, the potential hidden state representation is stopped to the corresponding time step, fault causal dependency history features in event streams formed by the event feature vectors are maintained based on an update equation with a content dependent gating mechanism; In response to receiving an IT operation and maintenance diagnosis request, mapping a potential hidden state corresponding to the current time step into a plurality of discrete state words, wherein each discrete state word corresponds to a vector space; And splicing the problem description corresponding to the IT operation and maintenance service interruption, the IT operation and maintenance event occurring in the target time window and a plurality of discrete state words into a target prompt, and inputting the target prompt into the frozen large language model to obtain a fault diagnosis result, wherein the starting time of the target time window is a time point corresponding to the preset duration before the IT operation and maintenance service interruption, and the ending time is a time point corresponding to the preset duration after the IT operation and maintenance service interruption.
  2. 2. The fault diagnosis method based on causal state space cues according to claim 1, characterized in that the mapping each IT operation and maintenance event into a corresponding event feature vector comprises: Mapping each IT operation and maintenance event into a structured tuple according to a preset encoder, wherein the structured tuple comprises an event type, an entity set and additional attribute description, wherein the loss of the preset encoder comprises a template prediction loss and a comparison invariance loss; Mapping the structured tuples to standard discrete event IDs; the standard discrete event ID is converted into an event feature vector.
  3. 3. The causal state space hint based fault diagnosis method according to claim 1, wherein the latent hidden state meets the following conditions: ; Wherein s t is the potential hidden state corresponding to the t-th time step, u t is the normalized event ID corresponding to the t-th time step; (u t ) is an event feature vector corresponding to the t-th time step, s t-1 is a potential hidden state corresponding to the t-1 st time step, and alpha t ,β t is an element-by-element gating vector; is an element-wise multiplication of symbols.
  4. 4. A fault diagnosis method based on causal state space cues according to claim 3, characterized in that α t ,β t is non-fixed, (α t ,β t ) meets the following conditions: (α t ,β t ) ( (u t );s t-1 ]); Wherein, the (U t );s t-1 ) is a vector obtained by splicing the event feature vector corresponding to the t-th time step and the potential hidden state corresponding to the t-1 th time step, wherein W is a learnable weight matrix; Is a Sigmoid activation function.
  5. 5. The causal state space hint based fault diagnosis method according to claim 1, wherein the evolutionary trajectories between potential hidden states are based on a causal loss function supervising the losses, wherein the causal loss function meets the following conditions: ; The method comprises the steps of L causal , A ij , g () and s i ,s j , wherein L causal is a causal loss value, BCE is binary cross entropy loss, A ij is used for representing the influence degree of an IT operation and maintenance event i on an IT operation and maintenance event j, g () is a leachable scoring device and is used for representing whether a causal relation exists between the IT operation and maintenance event i and the IT operation and maintenance event j, and s i ,s j is a potential hidden state corresponding to the IT operation and maintenance event i and the IT operation and maintenance event j respectively.
  6. 6. The fault diagnosis method based on causal state space cues according to claim 1, characterized in that in response to receiving an IT operation and maintenance diagnosis request, mapping the potential hidden state corresponding to the current time step into a plurality of discrete state words comprises: Mapping the potential hidden state corresponding to the current time step to a plurality of different vector spaces to obtain a plurality of projection vectors; inputting each projection vector into a learnable vector quantization module to find a prototype vector of which each projection vector is nearest to; Each prototype vector is mapped into discrete state primitives according to a preset decoder header.
  7. 7. The fault diagnosis method based on causal state space prompt according to claim 1, wherein the discrete state words have a diversity mechanism, and the diversity mechanism of the discrete state words is based on codebook-level diversity processing and slot-level structure allocation processing, wherein the codebook-level diversity processing maximizes entropy used by a codebook based on commitment punishment, the slot-level structure allocation processing performs word slot allocation, and the word slot allocation result comprises triggering evidence slots, propagation evidence slots and constraint evidence slots.
  8. 8. A fault diagnosis device based on causal status space cues, the device comprising: the collection unit is used for collecting a plurality of IT operation and maintenance events; the mapping unit is used for mapping each IT operation and maintenance event into a corresponding event feature vector; The system comprises an output unit, a potential hidden state, a storage unit, a content-dependent gating mechanism, a storage unit and a storage unit, wherein the output unit is used for continuously inputting event feature vectors into a presence state space model according to time sequence and outputting potential hidden states corresponding to each time step, the presence state space model has bounded memory capacity, the potential hidden states represent fault cause and effect dependent historical features in event streams formed by the event feature vectors after the corresponding time steps are reached, and the potential hidden states are maintained based on an update equation with the content-dependent gating mechanism; The word element acquisition unit is used for responding to the received IT operation and maintenance diagnosis request, and mapping the potential hidden state corresponding to the current time step into a plurality of discrete state word elements, wherein each discrete state word element corresponds to a vector space; The result acquisition unit is used for splicing the problem description corresponding to the IT operation and maintenance service interruption, the IT operation and maintenance event occurring in the target time window and a plurality of discrete state word elements into a target prompt, and inputting the target prompt into the frozen large language model to obtain a fault diagnosis result, wherein the starting time of the target time window is a time point corresponding to the preset duration before the IT operation and maintenance service interruption, and the ending time is a time point corresponding to the preset duration after the IT operation and maintenance service interruption.
  9. 9. A non-transitory computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement the method of any one of claims 1-7.
  10. 10. An electronic device comprising a processor and the non-transitory computer readable storage medium of claim 9.

Description

Fault diagnosis method and device based on causal state space prompt, medium and electronic equipment Technical Field The application belongs to the field of data processing, and particularly relates to a fault diagnosis method, device, medium and electronic equipment based on causal state space prompt. Background Modern IT services produce a large amount of heterogeneous telemetry data including underlying system logs, service level structured logs, tracking (Traces), index-based generated alarms, and manually written trouble tickets. During a service outage, the operation and maintenance personnel must infer the root cause and choose mitigation measures under uncertainty, time pressure, and stringent business constraints. The large language model is followed by instructions to make automation or assistance of this workflow increasingly feasible, as a single model can be unified through natural language for log parsing, hypothesis generation, and tool-based repair processes. However, failures in actual production often exceed the processing scope of modern Large Language Models (LLMs) under a limited context window. A single failure may cover several hours of continuous event flow, constantly changing topology, cascading retries, partial rollbacks, and cross-service dependencies. Thus, the technical bottleneck is not only an understanding of a single record, but also how to maintain a consistent length Cheng Tuili (Long-horizon Reasoning) as evidence continues to be entered. The operation and maintenance fault scene has significant differences from the existing general long text benchmark scene, and the differences systematically stress the sequence model from the following dimensions. The first is the time dimension that root causes may occur early, while symptoms appear long later (e.g., configuration drift, progressive resource leakage, certificate expiration). The second is causal sparsity, a key trigger may be scarce and expressed weakly, buried in repeated normal noise. Third is distribution offset and drift-templates, fields and identifiers vary with version and deployment-in case of overload, the performance of the detection tool may be degraded. Fourth is an action constraint, the diagnosis and mitigation is not separable, but the proposed operations must be in compliance with security policies, approval procedures and operation and maintenance manuals. These characteristics result in a simple "token up" strategy that does not match the actual requirements of the fault response, i.e., the causal links of the fault are preserved in bounded memory. There has been much work to extend the effective context length of a transducer through more efficient attention mechanisms, sparse/linear attention variants, or explicit recursion and segment-level storage. The extrapolation of locations and fine-tuning of long contexts further improves the length generalization capability. The stream reasoning method retains a set of fixed converged primitives (sink-token) to stabilize model behavior outside of the training window. At the same time, search enhancement hints are widely employed, bypassing context constraints by retrieving relevant evidence. Despite significant progress, these approaches have the drawbacks of searching for possible missing causal triggers when the query is symptomatic, the sliding window and converged word heuristic cannot guarantee that the retained evidence can maintain a dependency structure, and the computational overhead of the expanded attention mechanism is too expensive for a continuously running streaming telemetry system. Domain-specific log analysis has also shifted from traditional parsing and sequence modeling to basic model interfaces. The log parser and anomaly detectors promote robustness and interpretability, but most still assume that the sequence length is limited or relies on a fixed window. Furthermore, when faults span long dimensions, the dominant failure mode is often a trigger factor that is forgotten or is falsely weighted long, rather than lacking local pattern recognition capability. Even high-capability LLM inference engines can create hallucinations or over-commitments when key evidence is truncated, incorrect, or inundated with noise. This has prompted the development of an interface to implement (i) preserving causal evidence in bounded memory and (ii) exposing evidence in a stable, interpretable form to frozen reasoners. Recent studies on tool use and validation have shown that structured interfaces and constraint checking can reduce unordered generation, but these approaches still require a memory mechanism that reliably preserves length Cheng Yuanyin (Long-horizon Causes). Therefore, designing a bounded memory framework for causal relationship maintenance in a streaming update scenario becomes a key to solve the current technical bottleneck. Disclosure of Invention Aiming at the technical problems, the application provides a fault diagnosis method, a dev