CN-121999623-A - Traffic signal control method based on regional layered multi-agent reinforcement learning

CN121999623ACN 121999623 ACN121999623 ACN 121999623ACN-121999623-A

Abstract

The invention discloses a traffic signal control method based on regional layered multi-agent reinforcement learning, and belongs to the technical field of intelligent traffic systems. The method comprises the steps of dividing a target traffic network into a plurality of areas, configuring a management layer agent in each area, configuring control layer agents at all intersections in the areas to construct a double-layer layered architecture, periodically generating an area control target by the management layer agent based on area joint state observation, wherein the target is kept unchanged in a plurality of bottom layer control periods, collecting local state observation by the control layer agent in each control period, receiving the area control target, dynamically correcting the target based on local congestion state, further outputting signal lamp phase control actions, respectively calculating hierarchical rewards by the control layer and the management layer, storing experience data in an experience playback buffer zone, and cooperatively optimizing two-layer strategy functions by adopting an offline strategy training mode and combining a double-Q network, huber loss, delay updating and action entropy regularization mechanism.

Inventors

YANG KANGKANG
WANG ZHIWEN
LI LONG
Ling Guobi
WANG HAOXU
MIAO WEI
LI SHUANGJUN
LIU XIANGZHEN

Assignees

兰州理工大学

Dates

Publication Date: 20260508
Application Date: 20260304

Claims (8)

1. The traffic signal control method based on regional layered multi-agent reinforcement learning is characterized by comprising the following steps: dividing the area of the target traffic network, configuring a management layer intelligent agent for each divided area, and configuring a control layer intelligent agent for each intersection in the area; In each control period, each control layer agent collects the lane density of the lane entering at the corresponding intersection and receives the regional control target issued by the management layer agent; Each control layer intelligent agent dynamically corrects the received regional control target based on the lane density of the entrance lane of the intersection to obtain a corrected control target, and outputs a signal lamp phase control action through a control layer strategy function based on the corrected control target to execute the action; each control layer agent calculates the rewards of the control layer according to the control effect and stores the experience data into the experience playback buffer area; The management layer intelligent agent aggregates the state observations of all the control layer intelligent agents in the jurisdiction area of the management layer intelligent agent to form area joint state observations; The management layer agent outputs a regional control target for guiding a control layer through a management layer strategy function based on the regional joint state observation, and periodically calculates management layer rewards according to the regional overall traffic running effect; And respectively updating the control layer strategy function and the management layer strategy function by using the data in the experience playback buffer zone and adopting an offline strategy training mode so as to optimize the overall traffic signal control performance.
2. The method of claim 1, wherein the dividing the area of the target traffic network specifically comprises: Abstracting a target traffic network into a graph structure, wherein nodes represent intersections and edges represent roads; Obtaining the geographic coordinates of each intersection; And (3) dividing all intersections into a plurality of areas by adopting a K-Means clustering algorithm and taking geographic coordinates of the intersections as characteristics.
3. The method of claim 1, wherein the dynamic correction calculation formula is: Wherein, the In order to control the target after the correction, Respectively is And The lane density of the intersection entering lane at the moment, Indicating the area control target.
4. The method of claim 1, wherein the control layer rewards include social and emergency vehicle rewards calculated as: Wherein, the For controlling layer intelligent agent The total prize at the time t is, Indicating a social vehicle reward, As the weight coefficient of the light-emitting diode, The vehicle is rewarded for an emergency vehicle, Is that Intelligent body entering control layer at moment Controlling the number of emergency vehicles in the range; Wherein, the For the speed limit of the lane, At the speed of the minimum driving state, Is the average speed of the social vehicle, As an average speed of the emergency vehicle, For the number of lanes of the vehicle, And The queue lengths at the front and rear time instants are respectively indicated, And The lane vehicle densities at the front and rear timings are respectively indicated, 、 Is a weight coefficient.
5. The method of claim 1, wherein the empirical data is: Wherein, the Representing empirical data collected by the control agent policy, For controlling layer intelligent agent Is a combination of the total rewards of (a), The region control target at the time t is indicated, The region control target at time t +1 is indicated, The state observation at the time of t is indicated, The state observation at time t+1 is shown.
6. The method of claim 1, wherein the state observations of the control layer agents include at least one or more of: the method comprises the steps of boundary density of an intersection, vehicle queuing length of each lane of the intersection, maximum/minimum green light phase, current signal light phase, lane density of lanes of the intersection, lane density difference, number of emergency vehicles on the lanes, weighted time-varying average speed of the emergency vehicles on the lanes, distance from the emergency vehicles to the intersection and waiting time of the emergency vehicles.
7. The method of claim 1, wherein the current management agent balances inter-zone collaboration mechanisms by weighting rewards of other management agents, management layer rewards being: Wherein, the To coordinate the coefficients of rewards of the current management layer agent and the nearby agents, Representing a region Is provided with a set of control agents, Representing control layer agents The total prize at the time t is, Is a region Is defined as a set of contiguous regions of the display.
8. The method of claim 1, wherein the control layer agent Critic target network: Wherein, the Representing a Critic target network of control agents, As a discount factor, the number of times the discount is calculated, Representing control layer agents The total prize at the time t is, And The Q value of the target is indicated, A state observation at time t +1 is shown, The region control target at time t +1 is indicated, Indicating the phase control action of the signal lamp; The management layer agent Critic target network is defined as: Wherein, the Representing a Critic target network of management agents, Representing rewards at the moment of management layer agent t, And The Q value of the target is indicated, The region joint state observation at time t+1 is shown.

Description

Traffic signal control method based on regional layered multi-agent reinforcement learning Technical Field The invention relates to the technical field of traffic control, in particular to a traffic signal control method based on regional layered multi-agent reinforcement learning. Background With the acceleration of the global urbanization process, traffic jam and environmental pollution problems are increasingly serious. Under the realistic constraint of limited road infrastructure investment, urban Traffic Control (UTC) efficiency is improved by intelligently optimizing traffic signals, and the Intelligent Traffic System (ITS) has become a core development direction. The traditional traffic signal control method mainly comprises fixed timing and induction control, wherein the former depends on a preset period of historical traffic data, the adaptability is poor in a dynamic traffic environment, and the latter can be dynamically adjusted based on detector data, but is still limited by expert rules, and is difficult to cope with complex scenes such as sudden congestion, multi-source heterogeneous traffic flow and the like. In recent years, deep Reinforcement Learning (DRL) has demonstrated significant advantages in the field of Traffic Signal Control (TSC) due to its strong environmental awareness and adaptive decision-making capabilities. Particularly, the multi-agent reinforcement learning (MARL) realizes distributed cooperative control by deploying independent agents for each intersection, and effectively relieves the problem of dimension disaster of a single agent method in a large-scale road network. However, the existing MARL scheme still faces three key challenges, namely, the situation and rewarding design are rough, so that strategy learning feedback is sparse and convergence is slow, the joint situation-action space grows exponentially along with the scale expansion of the road network, model expandability and control performance are difficult to consider, and the environment non-stationarity is caused by multi-agent strategy synchronous update, so that algorithm stable convergence is seriously hindered. To break through the above bottleneck, researchers have attempted to introduce a Hierarchical Reinforcement Learning (HRL) framework that decomposes complex tasks through spatio-temporal abstraction. However, most existing HRL methods either focus on task decomposition in the time dimension only, or employ coarse-grained region partitioning, fail to effectively establish a dynamic collaboration mechanism between high-level target boot and underlying action execution. Especially in high timeliness scenes such as emergency vehicle priority traffic, the existing method lacks quick response capability to special traffic demands, local priority is replaced at the expense of overall road network efficiency, and multi-objective balance is difficult to achieve. Therefore, how to provide a multi-agent reinforcement learning model based on a small-area hierarchical architecture, and to construct a "management layer-control layer" double-layer collaborative system through a road network partition in a space dimension and a target-correction mechanism in a time dimension are the problems to be solved by those skilled in the art. Disclosure of Invention In view of the above, the present invention has been made in order to provide a traffic signal control method based on regional layered multi-agent reinforcement learning, which overcomes or at least partially solves the above-mentioned problems, by providing a regional layered architecture and a target dynamic correction mechanism, the method has the advantages that the control performance and the expandability of the large-scale road network signal are obviously improved, meanwhile, the emergency vehicles are effectively guaranteed to pass preferentially, the social traffic flow is not obviously affected, and the organic unification of global coordination, local adaptation and multi-objective optimization is realized. In order to achieve the above purpose, the present invention adopts the following technical scheme: in a first aspect, a traffic signal control method based on regional layered multi-agent reinforcement learning includes: dividing the area of the target traffic network, configuring a management layer intelligent agent for each divided area, and configuring a control layer intelligent agent for each intersection in the area; In each control period, each control layer agent collects the lane density of the lane entering at the corresponding intersection and receives the regional control target issued by the management layer agent; Each control layer intelligent agent dynamically corrects the received regional control target based on the lane density of the entrance lane of the intersection to obtain a corrected control target, and outputs a signal lamp phase control action through a control layer strategy function based on the corrected control t