CN-121979708-A - LM-driven operation and maintenance system management agent optimization method

CN121979708ACN 121979708 ACN121979708 ACN 121979708ACN-121979708-A

Abstract

The invention discloses an LM-driven operation and maintenance system management agent optimization method which comprises the following steps of S1, performing intent recognition basic optimization, training a fine-tuning LM language model based on an operation and maintenance scene corpus, introducing context association weight factors, constructing a multi-dimensional intent recognition model, and realizing accurate analysis of fuzzy operation and maintenance instructions. According to the invention, five-stage optimization of intention recognition, fault diagnosis, automatic operation, knowledge precipitation and cross-system integration is adopted, so that the problems of poor analysis of fuzzy instructions, single fault diagnosis dimension, high safety risk of automatic operation, low efficiency of knowledge precipitation and difficult cross-system cooperation of the existing operation and maintenance agent are solved, the fuzzy instruction analysis accuracy is more than or equal to 90%, the fault positioning accuracy is more than or equal to 85%, early warning is carried out for more than 1 hour, the high-risk operation execution error rate is less than or equal to 0.1%, the knowledge query response is less than or equal to 1 second, and the cross-system data intercommunication accuracy is more than or equal to 99% are solved, and the operation and maintenance efficiency is greatly improved.

Inventors

WENG ZHIMING
Lin Zunjie
HUANG CHUANHUI
GUO JIANYU
YANG LIGUO
JIANG HAN

Assignees

福建省数字福建云计算运营有限公司

Dates

Publication Date: 20260505
Application Date: 20251212

Claims (9)

1. An LM-driven operation and maintenance system management agent optimizing method is characterized by comprising the following steps: the method comprises the following steps of S1, performing intent recognition basic optimization, namely training a fine-tuning LM language model based on an operation and maintenance scene corpus, introducing context-associated weight factors, constructing a multi-dimensional intent recognition model, and realizing accurate analysis of fuzzy operation and maintenance instructions; S2, constructing a fault diagnosis framework, namely building a multi-level diagnosis flow of log preprocessing, abnormal feature extraction, multidimensional data fusion and root cause reasoning based on the optimized intention recognition capability, and improving the fault positioning accuracy and the prediction timeliness by combining the migration learning of a historical fault case library; s3, automatic operation optimization, namely designing a full-flow mechanism of script generation, safety verification, authority grading, secondary confirmation and log audit based on intention analysis and fault diagnosis results, and introducing a dynamic authority adaptation module to realize safety automatic execution under different operation and maintenance scenes; s4, knowledge precipitation optimization, namely constructing a closed loop system of task processing, experience extraction, structured storage and intelligent updating based on fault processing data and automatic operation experience, and adopting a semantic similarity matching algorithm to realize efficient retrieval and self-adaptive updating of a knowledge base; S5, cross-system integrated optimization, namely designing a standardized API adaptation layer and a data conversion module, supporting seamless butt joint of multi-class operation and maintenance tools of a monitoring system, a CMDB and a work order system, and realizing cross-platform data intercommunication and flow collaboration.
2. The LM-driven operation and maintenance system management agent optimizing method according to claim 1, wherein the specific logic steps of S1 are as follows: s101, constructing an operation and maintenance scene corpus, namely collecting historical operation and maintenance session, an operation manual and a public corpus of an enterprise, screening effective data containing instruction description-intention type-scene labels, and dividing the effective data into a training set, a verification set and a test set according to a ratio of 7:2:1 after cross labeling of 'manual and machine', so as to form a structured corpus; s102, selecting and pre-adapting a basic LM language model, namely selecting the LM language model according to the enterprise scale, selecting GPT-4Mini/Llama3 by a middle-small enterprise, selecting GPT-4 by a large enterprise, and adapting the operation and maintenance domain term understanding by using the LM language model which is selected by lightweight pre-training of Linux and K8S operation and maintenance documents; s103, LM language model fine tuning training, namely training a training set and a verification set to pre-adapt to the LM language model selected in the S102, setting a cosine annealing learning rate and BatchSize=8-16, adopting dropout regularization to inhibit over fitting, stopping training when the verification set intention recognition accuracy is continuously 3 rounds without lifting, and ensuring that the verification set accuracy is more than or equal to 85%; s104, designing a context association weight factor, namely constructing a context association total weight calculation model: Total weight = session round weight x 0.5+ keyword frequency weight x 0.3+ Scene complexity weight x 0.2; the conversation round weight is calculated according to 0.8 (n-1) , wherein n is the current conversation round, the keyword frequency weight is calculated according to 'min', the keyword frequency weight = keyword occurrence times x 0.2, the weight capping is 0.6 when the times are more than 3, the scene complexity weight is set to be 0.8 base, 1.0 medium and 1.2 complex according to the scene difficulty, and the total weight is fused into a model attention mechanism, so that the multi-round conversation understanding capability is improved; S105, designing a fuzzy instruction clarification mechanism, namely setting a Top-1 confidence coefficient <0.8 "of a model output Top-3 intention candidate as a fuzzy instruction judgment standard, and generating a closed and guided clarification technique aiming at a fuzzy instruction to ensure accurate analysis of the instruction; and S106, model verification, namely verifying the performance of the model by using a test set, wherein the accuracy rate of the intention recognition is more than or equal to 90%, the clarification success rate of the fuzzy instruction is more than or equal to 95%, and finally outputting the optimized intention recognition model.
3. The LM-driven operation and maintenance system management agent optimizing method according to claim 1, wherein the specific logic steps of S2 are as follows: s201, log preprocessing, namely, butting a system log and an application log, performing word segmentation by Jieba word library segmentation, performing duplication elimination according to a log ID, marking an anomaly field of 'ERROR' and 'Timeout', removing redundant data, and reserving effective log entries; S202, extracting abnormal features, namely extracting error codes and abnormal key word core features from the preprocessed logs by adopting a random forest algorithm, and constructing a fault abnormal feature vector library; s203, multi-dimensional data fusion, namely integrating monitoring indexes and system configuration data to form a multi-dimensional data source of 'log + monitoring + configuration', and breaking single data limitation; s204, root cause reasoning and fault prediction, namely, based on the existing causal relationship map model, combining with the transfer learning of a historical fault case library to locate a fault root, training the existing fault risk assessment model through historical fault data, and according to the formula: risk value = alpha x historical fault frequency + beta x real-time monitoring index deviation degree + Gamma x configuration change influence degree; calculating a fault risk value, wherein alpha+beta+gamma=1, and beta > alpha > gamma, and real-time indexes are preferentially considered; And the real-time monitoring index deviation degree is as follows: calculating, and realizing early warning of potential faults; And S205, verifying a fault risk assessment model, wherein the fault risk assessment model is required to meet the requirements that the fault positioning accuracy is more than or equal to 85%, the fault prediction advance is more than or equal to 1 hour, and outputting an optimized fault diagnosis frame.
4. The LM-driven operation and maintenance system management agent optimizing method according to claim 1, wherein the specific logic steps of S3 are as follows: s301, generating an operation and maintenance script, namely calling the LM language model optimized in the S1 to generate Shell/Python/Ansible script based on a user natural language operation and maintenance instruction, and synchronously outputting a script logic description so as to facilitate the check of operation and maintenance personnel; S302, performing security verification, namely firstly performing script grammar verification through SHELLCHECK tools, then performing logic verification in a dry-run mode in a simulation environment, and finally performing logic verification according to the formula: risk score = impact range weight x a + operational reversibility weight x B, task risk level is assessed; Wherein A is the influence range score of 1 minute of a single machine, 3 minutes of a cluster and 5 minutes of a whole system, B is the operation reversibility score of 1 minute of a recoverable machine, 3 minutes of a semi-recoverable machine and 5 minutes of an unrecoverable machine, and the influence range weight=0.6 and the operation reversibility weight=0.4; the risk grade 1-5 is divided into low risk grade 1-2, medium risk grade 3-4, high risk grade 5-8 and high risk grade 9-10 according to the risk grade; S303, carrying out authority grading and secondary confirmation, namely butting an LDAP user system of an enterprise, setting a three-level authority system, namely a common operation and maintenance person, a 1-2-level task, a 3-4-level task and an administrator, a 5-level task, wherein the 4-5-level dangerous task can be executed after the secondary confirmation is completed by the manual confirmation of the administrator and the verification of a short message verification code; s304, operation log audit, namely recording a full log of 'operators-operation time-instruction content-script content-execution result-risk level', storing the full log in a PostgreSQL non-tamperable database, and supporting retrieval and tracing according to dimensions; s305, dynamic authority adaptation, namely adjusting authority rules according to enterprise organization architecture, and ensuring authority management and control flexibility; S306, verifying that 200 times of automation tasks are executed, wherein the operation success rate is more than or equal to 98% and the high-risk operation error execution rate is less than or equal to 0.1%.
5. The LM-driven operation and maintenance system management agent optimizing method according to claim 1, wherein the specific logic steps of S4 are as follows: S401, knowledge data acquisition, namely automatically grabbing a fault diagnosis report output by a fault diagnosis frame in S2 and an automatic operation record of S3, and simultaneously supporting operation and maintenance personnel to manually input special scene experiences, wherein the fault diagnosis report comprises a fault phenomenon, a root cause and a solution, and the special scene experiences comprise but are not limited to rare fault processing methods; S402, knowledge structuring processing, namely extracting core information of 'problem description-fault cause-operation step-verification method' from the data acquired in S401 by using a BERT model naming entity recognition algorithm, formatting and storing according to a unified Markdown template, marking scene labels, constructing a knowledge index based on an elastic search, and supporting keywords and semantic retrieval; s403, self-adaptive updating of a knowledge base, namely generating an item word vector through a BERT model when a knowledge item is newly added, and forming a cosine similarity formula: calculating semantic similarity with the prior knowledge base items; A, B is word vector of the newly added item and the existing item respectively, if Sim (A, B) is more than or equal to 0.7, the existing item is updated, a new solution is supplemented, if Sim (A, B) is less than 0.7, the newly added item is automatically detected every month, redundant item is more than or equal to 0.9, and the latest complete item is reserved; S404, knowledge multiplexing function development, namely adding a knowledge query inlet on an intelligent agent interaction interface to ensure that query response time is met, wherein response time = index matching time + data return time is less than or equal to 1 second, and supporting a one-key calling solution; S405, verifying that 100 times of knowledge inquiry are randomly initiated, wherein response time is less than or equal to 1 second, knowledge matching accuracy is more than or equal to 90%, and newly added knowledge updating delay is less than or equal to 2 hours.
6. The LM-driven operation and maintenance system management agent optimizing method according to claim 1, wherein the specific logic steps of S5 are as follows: s501, developing an API gateway supporting RESTful, SOAP, gRPC protocols, and packaging the native APIs of a monitoring system, a CMDB, a work order system and a CI/CD platform tool into a unified 'agent interface', thereby reducing the calling complexity; S502, developing a data conversion module, namely converting heterogeneous data of each system into an intelligent agent universal JSON format, defining a unified data field, filling missing values and filtering abnormal values of the converted data, and ensuring the accuracy of the data; S503, integrating flow configuration, namely developing a visual configuration interface, supporting a 'drag' configuration cross-system flow, and setting a trigger mechanism as 'timing trigger+event trigger'; S504, fault tolerance and performance optimization, namely designing an API call failure retry mechanism, and sending an alarm to an operation and maintenance personnel after failure, wherein the operation and maintenance personnel are provided with the following formula: Optimizing high concurrency scene call efficiency, and ensuring that the success rate of API call is more than or equal to 99.5 percent and the data transmission delay is less than or equal to 1 second; s505, verifying, namely completing the butt joint of the core operation and maintenance systems of 3-5 types of enterprises according to the formula: Verification is carried out, wherein the accuracy rate of data intercommunication is more than or equal to 99%, the integration period of a new system is less than or equal to 3 days, and the success rate of flow execution is more than or equal to 98%.
7. The LM-driven operation and maintenance system management agent optimization method according to claim 2, wherein in S104, keywords are operation and maintenance intention core associated vocabulary, including "restart", "log", "fault", "performance", "deployment", "configuration", and are automatically statistically extracted and periodically updated by an operation and maintenance scene corpus.
8. The LM-driven operation and maintenance system management agent optimization method according to claim 5, wherein in S402, the scene label adopts a hierarchical labeling mode of "primary label-secondary label", the primary label includes "server management", "application deployment", "fault handling", "performance monitoring", "configuration adjustment", and the secondary label is subdivided based on specific operation and maintenance scenarios, including but not limited to "server management-Linux restart", "fault handling-database connection timeout", and the label system supports custom expansion according to enterprise operation and maintenance requirements.
9. The LM-driven operation and maintenance system management agent optimization method according to claim 6, wherein in S503, a time interval of the timing trigger is adjustable by a visual configuration interface being set to 1-30 minutes, the event trigger supports a custom trigger condition, including but not limited to "trigger a work order creation when a failure risk value is equal to or greater than 0.8", "trigger a knowledge base update after an automation operation is successfully executed", and all integrated flow configuration supports export backup and one-key import restoration.

Description

LM-driven operation and maintenance system management agent optimization method Technical Field The invention relates to the technical field of operation and maintenance system management, in particular to an LM-driven operation and maintenance system management intelligent agent optimization method. Background Along with the complexity of an enterprise IT architecture, the traditional operation and maintenance mode has the problems of low efficiency, high cost, dependence on manual experience and the like, while the existing LM-driven operation and maintenance intelligent body realizes a partial automation function, the method still has the defects of 1, poor adaptability of intention recognition to fuzzy instructions, insufficient context understanding, 2, multiple dependence on single data sources in fault diagnosis, limited positioning precision and prediction capability, 3, lack of perfect safety mechanism for automatic operation, easy initiation of operation risks, 4, most of knowledge precipitation, passive storage, low updating efficiency and inconvenient retrieval, 5, poor cross-system integration adaptability, difficult formation of cooperation with the existing operation and maintenance tool of enterprises, and the application provides the LM-driven operation and maintenance system management intelligent body optimizing method. Disclosure of Invention Based on the technical problems in the background technology, the invention provides an LM-driven operation and maintenance system management agent optimization method. The invention provides an LM-driven operation and maintenance system management agent optimization method, which comprises the following steps: the method comprises the following steps of S1, performing intent recognition basic optimization, namely training a fine-tuning LM language model based on an operation and maintenance scene corpus, introducing context-associated weight factors, constructing a multi-dimensional intent recognition model, and realizing accurate analysis of fuzzy operation and maintenance instructions; S2, constructing a fault diagnosis framework, namely building a multi-level diagnosis flow of log preprocessing, abnormal feature extraction, multidimensional data fusion and root cause reasoning based on the optimized intention recognition capability, and improving the fault positioning accuracy and the prediction timeliness by combining the migration learning of a historical fault case library; s3, automatic operation optimization, namely designing a full-flow mechanism of script generation, safety verification, authority grading, secondary confirmation and log audit based on intention analysis and fault diagnosis results, and introducing a dynamic authority adaptation module to realize safety automatic execution under different operation and maintenance scenes; s4, knowledge precipitation optimization, namely constructing a closed loop system of task processing, experience extraction, structured storage and intelligent updating based on fault processing data and automatic operation experience, and adopting a semantic similarity matching algorithm to realize efficient retrieval and self-adaptive updating of a knowledge base; S5, cross-system integrated optimization, namely designing a standardized API adaptation layer and a data conversion module, supporting seamless butt joint of multi-class operation and maintenance tools of a monitoring system, a CMDB and a work order system, and realizing cross-platform data intercommunication and flow collaboration. Preferably, the specific logic steps of S1 are as follows: s101, constructing an operation and maintenance scene corpus, namely collecting historical operation and maintenance session, an operation manual and a public corpus of an enterprise, screening effective data containing instruction description-intention type-scene labels, and dividing the effective data into a training set, a verification set and a test set according to a ratio of 7:2:1 after cross labeling of 'manual and machine', so as to form a structured corpus; s102, selecting and pre-adapting a basic LM language model, namely selecting the LM language model according to the enterprise scale, selecting GPT-4Mini/Llama3 by a middle-small enterprise, selecting GPT-4 by a large enterprise, and adapting the operation and maintenance domain term understanding by using the LM language model which is selected by lightweight pre-training of Linux and K8S operation and maintenance documents; s103, LM language model fine tuning training, namely training a training set and a verification set to pre-adapt to the LM language model selected in the S102, setting a cosine annealing learning rate and BatchSize=8-16, adopting dropout regularization to inhibit over fitting, stopping training when the verification set intention recognition accuracy is continuously 3 rounds without lifting, and ensuring that the verification set accuracy is more than or equal to 85%; s10