CN-121998620-A - Equipment cluster maintenance decision-making method based on multi-agent deep reinforcement learning

CN121998620ACN 121998620 ACN121998620 ACN 121998620ACN-121998620-A

Abstract

The invention discloses an equipment cluster maintenance decision method based on multi-agent deep reinforcement learning, which is oriented to the collaborative maintenance problem of a multi-place distributed equipment cluster under the condition of dynamic change of task demands. The method is based on the health state, performance level and task requirement of equipment, models an equipment cluster maintenance decision process as a Markov decision process, adopts a multi-agent deep reinforcement learning framework of centralized training-distributed execution to construct a local strategy network for each geographic deployment site, and simultaneously utilizes a centralized state value evaluation network to evaluate the overall running state of the equipment cluster so as to realize collaborative optimization of the equipment cluster maintenance decision deployed across sites. Through training the multi-agent strategy in the simulation environment and executing the multi-agent strategy on line in actual operation, each agent can independently generate a maintenance strategy based on the local equipment state, and comprehensive optimal control of the overall task capacity and the maintenance cost is realized at the system level. The method can realize the self-adaptive optimization of the maintenance strategy of the equipment cluster under the complex conditions of a large number of equipment, high state dimension and dynamic change of task requirements, thereby improving the overall task performance of the equipment cluster and reducing the maintenance cost.

Inventors

ZHANG QIN
CHEN TONG
WU JUN
LIU YU
HUANG HONGZHONG

Assignees

电子科技大学

Dates

Publication Date: 20260508
Application Date: 20260127

Claims (10)

1. The equipment cluster maintenance decision-making method based on multi-agent deep reinforcement learning is characterized by comprising the following steps of: Step 1, acquiring a spatial distribution structure of an equipment cluster, the number of equipment and a task cooperative relation among the equipment, acquiring performance indexes and state monitoring data of each equipment, and evaluating the health state and corresponding performance level of each equipment based on the state monitoring data; Step 2, constructing an equipment cluster maintenance decision model according to the health state and task requirements of each equipment in the cluster, modeling the equipment cluster maintenance decision model as a Markov decision process, defining a system state space, a maintenance action space, a state transition probability function, a reward function and a Belman equation, and describing the dynamic relationship among the equipment state, the maintenance action and the system performance; Step 3, solving an optimal maintenance strategy of the equipment cluster by adopting a multi-agent deep reinforcement learning algorithm, taking each geographical deployment place in the equipment cluster as an independent agent, constructing a strategy network based on a deep neural network for each agent, outputting maintenance actions according to local equipment states, constructing a centralized value evaluation network, and evaluating long-term benefits of joint maintenance decisions according to global equipment states, so that collaborative optimization among the multi-agents is realized; Step 4, constructing a simulation environment for running and degrading equipment clusters, in an offline training stage, each intelligent agent interacts with the simulation environment, collecting sample data such as equipment states, maintenance actions, system rewards, next-time states and the like, calculating an advantage function of a combined strategy based on a centralized value evaluation network, and updating strategy network parameters of each intelligent agent by adopting a strategy gradient and shearing optimization method so as to gradually learn by multiple intelligent agents to form an optimal collaborative maintenance strategy; and 5, collecting state information of each equipment in real time in the actual running process of the equipment cluster, independently outputting a local maintenance decision by each agent based on the trained strategy network, and continuously and adaptively updating the maintenance strategy along with the dynamic change of the equipment state and task requirements.
2. The multi-agent deep reinforcement learning-based equipment cluster maintenance decision method according to claim 1, wherein the step 2 is specifically as follows: representing equipment clusters as a system of multiple geographically deployed sites, where each site Comprising a plurality of equipment units Each equipment is a multi-state unit, and the health state set of each equipment is Wherein a larger state of health value indicates a higher level of equipment health and better performance availability. Each is arranged at The state of the moment is defined as . State transitions of equipment are subject to a discrete-time markov process, consisting of states Transition to State The probability of (2) is: (1) further, a one-step transition probability matrix of equipment states is defined as: (2) for each health state Defining a corresponding performance function For characterizing the contribution of equipment to the system task capacity in that state of health, then the location Is clustered in equipment of (1) The overall performance at time is defined as: (3) the whole equipment cluster system is in The overall performance at time is defined as: (4) equipment cluster in The task requirement to be met at any moment is a random variable Which obeys normal distribution And mean value of Time-varying and amenable to discrete-time Markov processes, whose state space is Wherein Representing the discrete level number of the task demand mean. The state transition of the task demand average level is defined by a one-step transition probability matrix: (5) is arranged at The maintenance decision variable of the moment is defined as , Representing a location Middle equipment At the position of The maintenance is carried out at the moment, Indicating that the equipment is continuing to operate. When the equipment is selected for maintenance, it is restored to an optimal health state at the next moment. The goal of equipment cluster maintenance decisions is to provide for limited duty cycles And setting maintenance strategies based on equipment states to maximize equipment cluster task profits. The method is modeled as a Markov decision process model, and the specific definitions of a state space, an action space, a state transition probability function, a reward function and a Belman equation are as follows: (a) State space equipment cluster in The state of the moment is defined as the set of all equipment health states: (6) Correspondingly, the state space is defined as: (7) (b) Action space: equipment clustered in The action at time is defined as the set of maintenance actions for all equipment: (8) Correspondingly, the action space is defined as: (9) (c) State transition probability function the state transition of the equipment cluster system is determined by the maintenance strategy of each equipment. If in Maintaining the equipment at the moment, and recovering to the optimal health state at the next moment Otherwise based on The represented state transition probability matrix performs state degradation. Location of site Middle equipment The state transition probability function of (2) can be expressed as: (10) The degradation of each equipment in the cluster is independent, and the state transition probability function of the cluster system is defined as: (11) (d) Rewarding function in Task demand at time As a random variable, the total performance of the equipment cluster system is that Thereby defining the task completion amount and the task demand gap as (12) And (13) Defining unit task completion benefit coefficients as The unit task demand gap punishment coefficient is Then Task completion benefits and task demand gap penalties of the time equipment cluster are respectively as follows: (14) And (15) The maintenance costs of the equipment cluster include single equipment maintenance costs and fixed costs. The cost of performing maintenance activities on a single piece of equipment is Local place of the vehicle At the position of At least one piece of equipment is maintained at any time, a fixed cost will be triggered . Order the (16) The total maintenance cost is: (17) based on the definition of task income, gap penalty and maintenance cost of the equipment cluster system, the system reward function is defined as follows: (18) (e) Belman equation in finite programming time domain In, define slave The optimal state cost function from time to time is: (19) the finite time domain bellman optimal equation is satisfied: (20) And at the end of the programming cycle The system state cost function is determined by the system consideration at that time: (21) The corresponding optimal maintenance strategy is: (22)。
3. The multi-agent deep reinforcement learning-based equipment cluster maintenance decision method according to claim 1, wherein the step 3 is specifically as follows: deploying geographical deployment sites in an equipment cluster Respectively modeling as independent agents, each agent can only observe the running state of equipment in the local site, and the intelligent agent is used for monitoring the running state of equipment in the local site Time, the first Local observations of individual agents are defined as deployment at a site Equipment state set of (c): (23) Building a local policy network for each agent In the form of local observations Time index As input, and outputting maintenance motion vectors for each equipment in the site: (24) The equipment clusters are formed by the local maintenance actions of the intelligent agents Joint maintenance decision at time: (25) further building a centralized value assessment network Its input is the global state of the equipment cluster With time index For outputting a cost function of the system.
4. The multi-intelligent reinforcement learning algorithm is executed under a centralized training-distributed execution framework, and all agents share a system-level value signal given by a centralized value evaluation network and are used for describing the contribution of maintenance decisions of each place to the overall performance of the equipment cluster, so that the cooperative consistency of the maintenance decisions of the multi-place equipment is realized.
5. The multi-agent deep reinforcement learning-based equipment cluster maintenance decision method according to claim 1, wherein the step 4 is specifically as follows: In the equipment cluster operation simulation environment, a centralized training-distributed execution multi-agent near-end strategy optimization algorithm is adopted to perform combined training on the strategy network and the centralized value evaluation network of each agent. Based on interaction between current agent strategy network and simulation environment, track sample sequence in task period is collected . Calculating a dominance function based on a centralized state cost function network, wherein the dominance function is used for measuring the improvement amplitude of benefits brought by joint maintenance actions in a system state, and the dominance function is expressed as follows in a generalized dominance estimation form: (26) Wherein, the (27) Wherein, the As a discount factor, the number of times the discount is calculated, Parameters are estimated for advantage.
6. For each agent Constructing a probability ratio based on the action probability output by the local strategy network: (28) Wherein, the Is a historical policy network when sampling trajectories.
7. Updating each agent policy network by adopting a shearing policy objective function, wherein the shearing policy objective function is defined as: (29) Wherein, the Is the shear threshold.
8. Updating centralized state cost function network parameters using cost regression loss functions The cost regression loss function is defined as: (30) Wherein, the The target is regressed for value constructed from trace sample returns.
9. Dividing the acquired track sample into a plurality of small batches of samples, performing multi-round iterative optimization, and alternately updating strategy network parameters of each intelligent agent And centralized state cost function network parameters Thereby yielding a coordinated maintenance strategy for the cluster of equipment.
10. The multi-agent deep reinforcement learning-based equipment cluster maintenance decision method according to claim 1, wherein the step 5 is specifically as follows: in the actual running process of the equipment, the trained network is utilized to output an optimal maintenance strategy to make an online decision, and the method comprises the following steps of: Step 51, collecting health state data of equipment in each geographical deployment place in real time in the actual running process of the equipment cluster, and constructing local observation of intelligent bodies in each place ; Step 52, local observation Policy network for inputting corresponding places Each agent independently outputs maintenance actions of each equipment in the location ; Step 53, combining the maintenance actions output by the agents to form a device cluster joint maintenance decision ; Step 54, based on joint repair decisions Maintaining the corresponding equipment in the equipment cluster, and continuing to operate the equipment cluster until the next decision-making time; Step 55, updating the equipment state according to the equipment cluster operation data, and executing steps 51 to 54 in a rolling way, so that the equipment cluster continuously performs self-adaptive maintenance scheduling in the whole task period according to the trained collaborative maintenance strategy.

Description

Equipment cluster maintenance decision-making method based on multi-agent deep reinforcement learning Technical Field The invention belongs to the technical field of equipment operation and maintenance guarantee, and particularly relates to an equipment cluster maintenance decision-making method based on multi-agent deep reinforcement learning. Background With the development of high-end equipment systemization, networking and clustering, modern tasks are often cooperatively completed by multiple pieces of equipment distributed in different geographic locations to form equipment clusters with spatially distributed features and task coupling relationships, such as aircraft equipment clusters, ship consists, and distributed manufacturing systems. Performance degradation and random faults of each equipment unit continuously occur in the task execution process, and the running state of each equipment unit directly influences the overall task capacity of the cluster. In this context, how to implement efficient, collaborative and intelligent maintenance decisions for distributed equipment clusters under complex task requirements and uncertain degradation conditions has become a key technical problem to be solved in the field of intelligent operation and maintenance of equipment. The existing equipment cluster maintenance decision usually depends on a rule-based maintenance strategy, a threshold-based state maintenance method or a centralized optimization scheduling method, and single equipment is often taken as a decision object, so that the coupling relation between the equipment states and task capacities of different clusters at different places is difficult to fully consider. Meanwhile, when facing cluster scenes with a large number of equipment, huge state space and dynamic change of task demands, the centralized modeling and optimizing method is easy to suffer from a dimension disaster, the calculation complexity grows exponentially along with the equipment scale, and real-time decision making is difficult to realize. In addition, most of the existing methods rely on manually setting rules or static model parameters, lack the capability of self-adaptive learning and continuous optimization of cluster maintenance strategies from operation data, and are difficult to cope with long-term task guarantee requirements in complex uncertain environments. Therefore, a method for realizing self-adaptive optimization of equipment cluster maintenance decision through agent cooperation and autonomous learning under the distributed equipment cluster condition is urgently needed, and task efficiency of the equipment cluster is improved. Disclosure of Invention In order to solve the technical problems, the invention provides the equipment cluster maintenance decision method based on multi-agent deep reinforcement learning, which can realize the self-adaptive optimization of the equipment cluster maintenance strategy under the complex conditions of a large number of equipment, high state dimension and dynamic change of task demands, thereby improving the overall task performance of the equipment cluster and reducing the maintenance cost. The technical scheme adopted by the invention is that the equipment cluster maintenance decision method based on multi-agent deep reinforcement learning comprises the following specific steps: Step 1, acquiring a spatial distribution structure of an equipment cluster, the number of equipment and a task cooperative relation among the equipment, acquiring performance indexes and state monitoring data of each equipment, and evaluating the health state and corresponding performance level of each equipment based on the state monitoring data; Step 2, constructing an equipment cluster maintenance decision model according to the health state and task requirements of each equipment in the cluster, modeling the equipment cluster maintenance decision model as a Markov decision process, defining a system state space, a maintenance action space, a state transition probability function, a reward function and a Belman equation, and describing the dynamic relationship among the equipment state, the maintenance action and the system performance; Step 3, solving an optimal maintenance strategy of the equipment cluster by adopting a multi-agent deep reinforcement learning algorithm, taking each geographical deployment place in the equipment cluster as an independent agent, constructing a strategy network based on a deep neural network for each agent, outputting maintenance actions according to local equipment states, constructing a centralized value evaluation network, and evaluating long-term benefits of joint maintenance decisions according to global equipment states, so that collaborative optimization among the multi-agents is realized; Step 4, constructing a simulation environment for running and degrading equipment clusters, in an offline training stage, each intelligent agent interacts with the simulatio