CN-122021271-A - Reinforced learning-based intelligent maintenance strategy optimization method for cable accessories

CN122021271ACN 122021271 ACN122021271 ACN 122021271ACN-122021271-A

Abstract

The invention discloses an intelligent maintenance strategy optimization method for cable accessories based on reinforcement learning, which comprises the following steps of S1, obtaining data, constructing a state characteristic sequence, an action sequence and a degradation observation sequence, S2, training a hidden semi-Markov model to obtain a hidden state transition relation and duration time expression, S3, constructing a semi-Markov decision process according to the hidden state transition relation and the duration time expression to generate a simulation track, S4, extracting time weighting characteristics of expert tracks and the simulation track, constructing a maximum entropy inverse reinforcement learning optimization target, S5, iteratively solving rewarding function parameters, S6, executing incremental updating of the rewarding function, S7, inputting the rewarding function into a state evolution model, and outputting the maintenance strategy. The invention realizes the accurate degradation identification of the cable accessories and the dynamic optimization of the maintenance strategy, and improves the maintenance timeliness, the accuracy and the operation reliability.

Inventors

WEI KAI
Sun Guanyue
SUN XIAOWEI

Assignees

山东七星电气科技发展有限公司

Dates

Publication Date: 20260512
Application Date: 20260117

Claims (8)

1. The intelligent maintenance strategy optimization method for the cable accessories based on reinforcement learning is characterized by comprising the following steps of: S1, acquiring cable accessory operation data, environmental parameters, monitoring records and expert maintenance tracks, and constructing a state characteristic sequence and an action sequence to form a degradation stage observation sequence; s2, training a hidden semi-Markov model based on a degradation stage observation sequence, and setting a degradation stage set, a stage transition probability matrix and stage duration distribution to obtain a hidden state transition relation and duration expression; S3, constructing a semi-Markov decision process according to the hidden state transition relation and the duration expression, setting a state space, an action space, a state transition rule and action duration constraint, forming a state evolution model, and generating a simulation track based on the state evolution model; s4, defining a time weighted feature set containing states, actions and action duration, extracting time weighted feature information in expert maintenance tracks and simulation tracks, and constructing a maximum entropy inverse reinforcement learning optimization target; S5, initializing bonus function parameters, calculating expert trajectory probability distribution based on maximum entropy inverse reinforcement learning, executing gradient iteration to solve the bonus function parameters, and generating a current bonus function; S6, receiving a newly-added expert maintenance track, introducing a track time tag and a data weight attenuation factor, extracting newly-added time weighting characteristic information, executing incremental updating of the parameters of the reward function based on maximum entropy inverse reinforcement learning, and generating an updated reward function; S7, inputting the reward function into a state evolution model, executing strategy solving iteration by adopting a reinforcement learning algorithm, and outputting a cable accessory maintenance strategy.
2. The reinforcement learning-based intelligent maintenance policy optimization method for cable accessories according to claim 1, wherein S1 specifically comprises: s11, acquiring cable accessory operation data, environment parameters and monitoring records, and aligning current amplitude, load level, conductor temperature, environment humidity, sheath current and partial discharge detection data in time sequence to form a state characteristic sequence; s12, acquiring an expert maintenance track, arranging a patrol action, a replacement action, a overhaul action and a pre-test action according to maintenance start time and maintenance end time, and constructing an action sequence; S13, performing degradation related feature extraction based on the partial discharge detection data, the insulation resistance test data, the conductor temperature and the environmental humidity, and arranging the extracted features in time sequence to form a degradation stage observation sequence.
3. The reinforcement learning-based intelligent maintenance policy optimization method for cable accessories according to claim 2, wherein the step S3 specifically comprises: S21, setting a degradation stage set of a hidden semi-Markov model based on a degradation stage observation sequence, and dividing the degradation stage into an initial degradation stage, a slow degradation stage, an accelerated degradation stage and a critical degradation stage according to a preset sequence to form the degradation stage set; S22, setting a state duration distribution function for each degradation stage in the degradation stage set, and respectively constructing duration distribution of duration characteristics of each degradation stage according to statistical distribution of degradation characteristics of each time period in the degradation stage observation sequence to form a duration distribution set; s23, calculating state transition frequency among the degradation stages according to the change condition of degradation characteristics of adjacent time periods in the degradation stage observation sequence, and generating a degradation stage transition probability matrix based on the transition frequency; S24, inputting a degradation stage observation sequence, a degradation stage set, a duration time distribution set and a degradation stage transition probability matrix into a hidden semi-Markov model, and performing model training to obtain hidden semi-Markov model parameters; s25, based on the trained hidden semi-Markov model, performing hidden state inference on the degradation stage observation sequence to generate a hidden degradation stage sequence; S26, using the retirement stage sequence and the corresponding duration as a hidden state transition relation and duration expression.
4. The reinforcement learning-based intelligent maintenance policy optimization method for cable accessories according to claim 3, wherein S3 specifically comprises: S31, forming state elements by corresponding time retirement stages and state characteristics, arranging all the state elements according to a time sequence to form a state space of a semi-Markov decision process, and adding a patrol operation, a maintenance operation, a pre-test operation and a replacement operation into the operation space according to preset operation numbers; s32, establishing state transition connection between the initial state element corresponding to each stage and the target state element according to the hidden state transition relation, and writing the transition probability of the corresponding stage in the hidden state transition relation into the state transition connection to form a state transition rule; S33, reading each retirement stage in the retirement stage sequence according to the duration expression, writing duration distribution corresponding to the retirement stage in the duration expression into state elements associated with the retirement stage, and forming action duration constraint by referring to the duration distribution for all actions in the action space; S34, integrating the state space, the action space, the state transition rule and the action duration constraint according to a preset structure to construct a state evolution model, and calling the state evolution model to sequentially execute state reading, action selection, state transition and action duration distribution from any initial time index to generate a state sequence, an action sequence and an action duration sequence to form a simulation track.
5. The reinforcement learning-based intelligent maintenance policy optimization method for cable accessories of claim 4, wherein S4 specifically comprises: S41, setting a time weighted feature template, determining a state feature dimension and an action code dimension, splicing the state feature corresponding to each time index and the action code to generate a basic feature vector, and sequentially reading a state sequence, an action sequence and an action duration sequence according to the time index for an expert maintenance track to construct an expert basic feature vector sequence; S42, reading corresponding action duration time for each basic feature vector in the expert basic feature vector sequence, multiplying the action duration time as a weight by each component of the basic feature vector component by component to generate an expert time weighted feature vector, accumulating all the expert time weighted feature vectors according to feature dimensions to generate an expert accumulated time weighted feature vector; S43, reading a state sequence, an action sequence and an action duration sequence of the simulated track in the same way as the S41 and the S42, constructing a simulated basic feature vector sequence and a simulated time weighted feature vector sequence, and accumulating according to feature dimensions to generate a simulated accumulated time weighted feature vector; S44, setting the expert accumulated time weighted feature vector as an expert expected feature, setting the simulation accumulated time weighted feature vector as a model expected feature, and combining the expert expected feature and the model expected feature as a time weighted rewarding expression.
6. The reinforcement learning-based intelligent maintenance policy optimization method for cable accessories according to claim 5, wherein S5 specifically comprises: S51, initializing a bonus function parameter, and setting a bonus function parameter dimension corresponding to the time weighted bonus expression; S52, calculating the logarithmic probability of the expert track according to the time weighted reward expression, multiplying the expert accumulated time weighted feature vector in the time weighted reward expression with the reward function parameters component by component and accumulating to generate an expert expected reward value; S53, calculating the logarithmic probability of the simulation track according to the time weighted reward expression, multiplying the model accumulated time weighted feature vector with the reward function parameter component by component and accumulating to generate a model expected reward value; s54, carrying out differential operation on expert expected reward values and model expected reward values according to a preset solving structure, generating gradient components of the reward function parameters, carrying out gradient iteration on the reward function parameters according to iteration step sizes, and updating the reward function parameters; S55, repeatedly executing S52 to S54 until the bonus function parameters meet the convergence condition, and forming the current bonus function.
7. The reinforcement learning-based intelligent maintenance policy optimization method for cable accessories of claim 6, wherein S6 specifically comprises: s61, receiving a newly added expert maintenance track, reading a state sequence, an action sequence and an action duration sequence, splicing the state features corresponding to each time index and the action codes to form a newly added expert basic feature vector, and multiplying the newly added expert basic feature vector component by component according to the action duration to generate a newly added expert time weighted feature vector; S62, setting a time tag for a newly added expert maintenance track, converting the time tag into an attenuation coefficient, multiplying the attenuation coefficient by the newly added expert time weighted feature vector component by component, and accumulating according to feature dimensions to generate the newly added expert accumulated time weighted feature vector; S63, adding the newly added expert accumulated time weighted feature vector and the expert accumulated time weighted feature vector according to feature dimensions to form a superimposed expert accumulated time weighted feature vector, and arranging the superimposed expert accumulated time weighted feature vector and the model accumulated time weighted feature vector according to feature dimensions to form a new time weighted rewarding expression; S64, updating the rewarding function parameters according to the new time weighted rewarding expression, subtracting the overlapped expert accumulated time weighted feature vector and the model accumulated time weighted feature vector according to feature dimensions component by component to form a gradient vector of the rewarding function parameters, adding the gradient vector according to iteration step length and the rewarding function parameters component by component, and updating the rewarding function parameters; s65, repeatedly executing S61 to S64 until the difference value of the parameter update of the continuous two times of rewarding function is lower than the set threshold value, and obtaining the updated rewarding function.
8. The reinforcement learning-based intelligent maintenance policy optimization method for cable accessories of claim 7, wherein S7 specifically comprises: S71, writing a reward function into a state evolution model, reading each state element in a state space, and calculating reward values of all actions in a corresponding action space to form a state action reward set; S72, reading a state transition rule according to the state action rewarding set, and combining action rewarding values corresponding to each state element with state transition probability to generate state action value expression; S73, referring to action duration constraint according to state action value expression, and combining the action duration and the state action value to form a state action value sequence; And S74, according to the state action value sequence, performing action selection on each state element in the state space, generating maintenance actions of the corresponding state, and arranging the maintenance actions corresponding to all the state elements to form a cable accessory maintenance strategy.

Description

Reinforced learning-based intelligent maintenance strategy optimization method for cable accessories Technical Field The invention relates to the technical field of intelligent operation and maintenance of power equipment, in particular to an intelligent maintenance strategy optimization method for cable accessories based on reinforcement learning. Background The cable accessories are used as key components for connection, insulation and terminal sealing in a power system, and the running state of the cable accessories directly influences the reliability and the safety of a power distribution network. In the prior art, the maintenance of the cable accessories mainly depends on modes such as inspection record, partial discharge detection, insulation resistance test, environment monitoring and the like, and the degradation degree is judged through manual experience and a maintenance plan is formulated. The traditional method has limitations in degradation stage identification, maintenance behavior association analysis and policy generation. The degradation process has implication and time correlation, the existing method mostly adopts fixed threshold value or simple trend analysis, can not express the hidden state change rule of the degradation stage, and also has difficulty in accurately describing the duration time characteristics of each stage, so that the degradation state inference is unstable. The existing maintenance strategy generation is mainly established on the basis of manual experience, a rule base or a static model, and cannot combine action effect differences of different degradation stages, and cannot consider the influence of action duration on degradation evolution, so that the strategy lacks pertinence and time sequence rationality. When learning is performed by using expert maintenance tracks, the prior art generally processes the weights of the tracks, and the problem of knowledge attenuation caused by track timeliness is not considered, so that newly-added data cannot be effectively integrated into a model, and dynamic update of maintenance strategies is difficult to realize. Therefore, how to provide a reinforcement learning-based intelligent maintenance strategy optimization method for cable accessories is a problem that needs to be solved by those skilled in the art. Disclosure of Invention The invention aims to provide an intelligent maintenance strategy optimization method for cable accessories based on reinforcement learning, which is used for constructing a degradation phase sequence, expressing action duration time characteristics, generating time weighted rewarding expression and realizing incremental updating of a rewarding function, forming a maintenance strategy capable of being dynamically adjusted along with an operation state and has the advantages of accurate degradation identification, reasonable action time sequence and timely strategy updating. According to the embodiment of the invention, the intelligent maintenance strategy optimization method for the cable accessory based on reinforcement learning comprises the following steps: S1, acquiring cable accessory operation data, environmental parameters, monitoring records and expert maintenance tracks, and constructing a state characteristic sequence and an action sequence to form a degradation stage observation sequence; s2, training a hidden semi-Markov model based on a degradation stage observation sequence, and setting a degradation stage set, a stage transition probability matrix and stage duration distribution to obtain a hidden state transition relation and duration expression; S3, constructing a semi-Markov decision process according to the hidden state transition relation and the duration expression, setting a state space, an action space, a state transition rule and action duration constraint, forming a state evolution model, and generating a simulation track based on the state evolution model; s4, defining a time weighted feature set containing states, actions and action duration, extracting time weighted feature information in expert maintenance tracks and simulation tracks, and constructing a maximum entropy inverse reinforcement learning optimization target; S5, initializing bonus function parameters, calculating expert trajectory probability distribution based on maximum entropy inverse reinforcement learning, executing gradient iteration to solve the bonus function parameters, and generating a current bonus function; S6, receiving a newly-added expert maintenance track, introducing a track time tag and a data weight attenuation factor, extracting newly-added time weighting characteristic information, executing incremental updating of the parameters of the reward function based on maximum entropy inverse reinforcement learning, and generating an updated reward function; S7, inputting the reward function into a state evolution model, executing strategy solving iteration by adopting a reinforcement learning