CN-122024468-A - Multi-agent reinforcement learning scheduling method and system under vehicle-road cooperation

CN122024468ACN 122024468 ACN122024468 ACN 122024468ACN-122024468-A

Abstract

The invention provides a multi-agent reinforcement learning scheduling method and system under vehicle-road cooperation, which are characterized in that standardization and alignment of multi-source perception features are realized by establishing a unified space-time coordinate system, multi-source data matching is completed by adopting a cross-agent target association model, sensor blind area information is supplemented by combining context reasoning, meanwhile, nearby fusion of data is realized by means of edge calculation based on blind area priority dynamic scheduling perception resources, a dual-branch model is constructed based on global perception data output by the cross-agent dynamic perception fusion and blind area compensation technology, real-time detection and quantization training and distribution offset of deployment environment are realized, MDP parameter online increment update is realized by a Bayesian inference framework, sudden environment change is quickly adapted by a combination element learning module, and multi-scene mixing training and domain self-adaption regular term optimization are further realized.

Inventors

SONG FUCAI
ZHANG NA
YUE XIAOHAN
GONG YANXUE
YAN HANBING

Assignees

青岛市交通科学研究院

Dates

Publication Date: 20260512
Application Date: 20251223

Claims (10)

1. The multi-agent reinforcement learning scheduling method under the cooperation of the vehicle and the road is characterized by comprising the following steps: Realizing standardization and alignment of multisource perception features by establishing a unified space-time coordinate system, completing multisource data matching by adopting a cross-agent target association model, completing sensor blind area information by combining context reasoning, dynamically scheduling perception resources based on blind area priority, and realizing data nearby fusion by means of edge calculation; Based on global perception data output by a cross-agent dynamic perception fusion and blind area compensation technology, constructing a double-branch model, detecting and quantifying the distribution offset of training and deployment environments in real time, realizing MDP parameter online increment updating through a Bayesian inference framework, quickly adapting to sudden environment changes by combining a meta learning module, and optimizing through multi-scene mixed training and domain self-adapting regular terms; combining environment parameters after non-stationary MDP dynamic adaptation and generalization enhancement mechanism adaptation, quantifying decision uncertainty by adopting a Bayesian modeling mode, triggering a corresponding decision mechanism, reconstructing a reward function by resisting the anti-interference capability of an anti-training and fault injection enhancement model on interference factors occurring in the inter-agent sensing process, and integrating the reward function into a safe hard constraint, so as to construct a multi-objective optimization system; Based on the perception support provided by the cross-agent dynamic perception fusion and blind area compensation technology and the environment state after the non-stationary MDP dynamic adaptation and generalization enhancement mechanism adaptation, a lightweight intention coding scheme is designed, the intelligent agent intention high-efficiency sharing is realized through an on-demand broadcasting mechanism, a dynamic interaction diagram is constructed, the interaction result among the intelligent agents is predicted by utilizing a graph neural network, the decision security ensured by the robust training framework is combined, and the decision conflict is resolved in advance through a distributed conflict resolution algorithm and a collaborative excitation mechanism.
2. The vehicle-road collaborative multi-agent reinforcement learning scheduling method according to claim 1, wherein the realizing of standardization and alignment of multi-source perception features by establishing a unified space-time coordinate system, completing multi-source data matching by adopting a cross-agent target association model, supplementing sensor blind area information by combining context reasoning, dynamically scheduling perception resources based on blind area priority, and realizing data nearby fusion by means of edge calculation comprises: Selecting a road side unit node as a reference point, defining a three-dimensional global coordinate system, wherein the transverse axis of the three-dimensional global coordinate system is a road extending direction, the longitudinal axis of the three-dimensional global coordinate system is a vertical road direction, the vertical axis of the three-dimensional global coordinate system is a height direction, defining coordinate mapping rules of a vehicle-end intelligent body and road side equipment, matching with environmental characteristics of the road side laser radar through GPS/Beidou positioning data, completing initial coordinate calibration of each equipment, and establishing a binding relation between equipment ID and global coordinates; Uniformly calibrating time stamps of vehicle end and road side sensing equipment by adopting a network time protocol, correcting time deviation in a data transmission process by a delay compensation algorithm, performing space coordinate conversion on sensing data output by each equipment, and mapping target positions and speed parameters under a vehicle end local coordinate system to the three-dimensional global coordinate system by a coordinate transformation matrix; Extracting target characteristics collected by each device, wherein the target characteristics comprise target types, positions, speeds and confidence degrees, carrying out normalization processing on continuous position and speed characteristics by adopting a Min-MaxScaling algorithm, mapping characteristic values to a unified interval, introducing a field self-adaptive regular term, correcting perceived deviation of a vehicle-end camera and a radar by taking perceived data of a road-side laser radar as a reference, eliminating characteristic deviation caused by device hardware difference, and outputting a standardized target characteristic quadruple; Designing a target association cost function based on a Hungary algorithm, wherein the calculation factor of the target association cost function comprises the position similarity of a target under a global coordinate system, the motion track consistency after the history position is fitted, the target type matching degree and the confidence coefficient difference value, constructing a dynamic threshold adjustment module, dynamically adjusting the association matching cost threshold by counting the road network traffic density in real time, and improving the target matching coverage rate of multi-source data; Based on hardware parameters of each sensing device and real-time environment detection results, wherein the hardware parameters comprise detection ranges, angles and resolutions, the real-time environment detection results comprise vehicle shielding, building outlines and road topography, the detection blind area ranges of each device are determined through geometric calculation, the blind area positions are mapped to the three-dimensional global coordinate system, and the boundary, the size and the distribution condition of surrounding perceived targets of the blind areas are marked to form a blind area information list; Collecting historical track data of targets matched with the periphery of a blind area, extracting a target motion rule through polynomial fitting, inputting the peripheral target motion rule, lane line trend and road topology context information into a pre-trained generation type countermeasure network, generating a probability distribution thermodynamic diagram of the targets in the blind area range, determining position and speed parameters of the blind area targets based on thermodynamic diagram peak area and motion rule prediction, and completing a global perception state; Constructing a blind area priority evaluation model, selecting a blind area coverage area and a traffic risk level as evaluation indexes, wherein the blind area coverage area comprises the number of affected lanes and the road network node ratio, the traffic risk level comprises scene weights corresponding to intersections, school areas and construction road sections, calculating priority scores of all the blind areas through weighted summation, and marking high-priority blind areas; According to the blind area priority evaluation result, a perception resource scheduling instruction is sent to an agent in a corresponding area, a perception sampling frequency lifting mechanism is triggered for vehicle-end equipment in the coverage area of a high-priority blind area, the perception sampling frequency is reduced for non-blind area or low-priority blind area so as to reduce the data transmission quantity, the data transmission priority is dynamically adjusted based on real-time monitoring data of the road network bandwidth, and the perception data priority transmission of the high-priority blind area is ensured; And deploying edge computing nodes at road network positions, dividing service ranges of the edge computing nodes, ensuring that vehicle ends and road side sensing data are accessed closely, receiving standardized sensing data and blind area complement results uploaded by the intelligent agents by the edge computing nodes, carrying out fusion processing on multi-source observation data of the same target by adopting a multi-source data fusion algorithm, removing abnormal values, correcting measurement errors, generating global unified sensing results after fusion is completed, and feeding back the global unified sensing results to the intelligent agents and a scheduling system through a low-delay transmission channel.
3. The vehicle-road collaborative multi-agent reinforcement learning scheduling method according to claim 2, wherein the method is characterized in that a dynamic threshold adjustment module is constructed for dynamically adjusting a cost threshold of association matching by counting road network traffic density in real time, and improving target matching coverage rate of multi-source data, wherein the method comprises the steps of: Extracting standardized perception target data uploaded by each vehicle end and road side device, wherein the standardized perception target data comprises a target unique identifier, a position coordinate under a global coordinate system, a time stamp, a target type, a confidence level and a history position sequence of a latest preset number of time steps, filtering abnormal values of the perception target data, removing abnormal data of the target with the confidence level lower than a preset minimum threshold value and the position coordinate exceeding a reasonable range of a road network, and reserving effective data; Based on a global coordinate system, calculating Euclidean distances of position coordinates among targets to be matched of different equipment sources, and carrying out normalization processing on the Euclidean distances to obtain position similarity mapped to a [0,1] interval, wherein the closer the position similarity is to 1, the stronger the position relevance of the targets to be matched is; A historical position sequence of each target is subjected to quadratic polynomial fitting to obtain a track fitting curve, the slope change rate of the track fitting curve is calculated, normalization processing is carried out through the absolute difference value of the slope change rate of the target to be matched to obtain motion track consistency, and the closer the motion track consistency is to 1, the stronger the correlation of the track of the target to be matched is; Constructing a target type matching matrix, respectively endowing the targets with the same type, similar type and different type with corresponding matching degree values, and inquiring the target type matching matrix to obtain the type matching degree of the target to be matched; extracting a confidence coefficient value of a target to be matched, calculating an absolute difference value of the confidence coefficient value and the confidence coefficient value, and carrying out normalization treatment to obtain a confidence coefficient difference value, wherein the closer the confidence coefficient value is to 0, the higher the confidence coefficient consistency of the target to be matched is; A cost function is constructed in a weighted summation mode, the cost function takes the position similarity, the motion track consistency, the target type matching degree and the confidence coefficient difference value as calculation factors, each calculation factor is configured with a corresponding weight coefficient, the weight coefficients are determined through cross verification, and the smaller cost value calculated by the cost function is the higher the target association degree to be matched; Dividing grid cells by taking a service range of an edge computing node as a statistical region, counting the target number of each grid cell in unit time by sensing target data uploaded by each device, computing the average traffic density of the region, and carrying out moving average filtering processing on the average traffic density to obtain a smoothed traffic density value; Dividing three traffic density intervals of low density, medium density and high density, setting basic matching thresholds for each traffic density interval, wherein the basic matching thresholds are adjusted downwards by the high density interval, the basic matching thresholds are kept in the medium density interval, and the basic matching thresholds are adjusted upwards by the low density interval to form dynamic matching thresholds; constructing all targets to be matched into bipartite graphs, taking cost values calculated by the cost function as edge weights of the bipartite graphs, calling a Hungary algorithm to solve minimum weight matching, screening out matching pairs with cost values smaller than the dynamic matching threshold, determining a multi-source data association result of the same target, marking the targets which are not successfully matched as new targets or temporary vanishing targets, and incorporating the targets into a next round of matching period; and counting the target matching coverage rate and the false matching rate, dynamically adjusting the matching threshold value of the corresponding traffic density interval according to the two types of indexes, and periodically optimizing the cost function weight coefficient based on the matching effect data.
4. The method for multi-agent reinforcement learning and scheduling under cooperative vehicle-road conditions according to claim 2, wherein the collecting historical track data of matched targets around the blind area, extracting target motion rules through polynomial fitting, inputting peripheral target motion rules, lane line trend and road topology context information into a pre-trained generating type countermeasure network, generating a probability distribution thermodynamic diagram of targets in the blind area range, determining position and speed parameters of the blind area targets based on thermodynamic diagram peak area and motion rule prediction, and complementing global perception states, comprises: Setting a space range by using a dead zone boundary, screening a correlation matching target, extracting historical track data containing global coordinates and time stamps, denoising the track data, supplementing missing data, ensuring track continuity, and unifying time intervals to ensure that the track is consistent with a time synchronization standard of global perception data; Polynomial fitting is carried out on each dimension track of each peripheral target, relevant parameters of motion, speed and acceleration are obtained, and the parameters are integrated into a motion law feature vector; Extracting lane line information around a blind area through multi-source sensing equipment, acquiring lane line direction and extending trend, collecting road topology information of a road section where the blind area is located, carrying out standardized processing on motion rule feature vectors, lane line related information and road topology information, and splicing according to preset rules to form a context feature matrix; Obtaining geometrical parameters such as dead zone boundary coordinates, shapes, grid partitions and the like, fusing a context feature matrix with the geometrical parameters, mapping the geometrical parameters into an input vector with fixed dimension through a full-connection layer, and performing fine tuning adaptation on a pre-trained generation type countermeasure network so that the input vector can be received and a target probability distribution thermodynamic diagram corresponding to the dead zone grid can be output; the input vector is transmitted into a fine-tuned countermeasure network generator to generate an initial target probability distribution thermodynamic diagram, a discriminator carries out authenticity evaluation and feedback optimization on the initial thermodynamic diagram until the initial thermodynamic diagram accords with a traffic scene rule, masking processing is carried out by combining priori knowledge, and probability distribution of an unreasonable area is removed; adopting a non-maximum suppression algorithm to process a thermodynamic diagram, screening peak areas, extracting central coordinates of each peak area as an initial position, correcting by combining a peripheral target motion rule, a lane line trend and a road topology constraint, predicting a speed parameter of a target at the current moment, and distributing a confidence coefficient for each inferred target to form a target parameter triplet of the position, the speed and the confidence coefficient; integrating the target parameter triplets into a global perception data set, complementing blind area perception vacancies, verifying the spatial compatibility of the complemented target and peripheral perceived targets, and the consistency of the complemented target and the historical movement trend in the time dimension, and readjusting the thermodynamic diagram generation result if position conflict or movement mutation exists until the consistency requirement is met; And monitoring the follow-up real perception data of the blind area completion target, calculating the error of the completion parameter and the real data, updating the countermeasures network model parameter by taking the error as a supervision signal, and retraining the track fitting model and the countermeasures network model according to a preset period by utilizing the accumulated error data and the new traffic scene sample, thereby continuously improving the accuracy and the robustness of blind area target completion.
5. The method for multi-agent reinforcement learning scheduling under vehicle-road cooperation according to claim 2, wherein the constructing a blind area priority evaluation model, selecting a blind area coverage area and a traffic risk level as evaluation indexes, wherein the blind area coverage area comprises the number of affected lanes and the road network node occupation ratio, the traffic risk level comprises scene weights corresponding to intersections, school areas and construction road sections, calculating priority scores of all the blind areas through weighted summation, and marking high-priority blind areas, comprises: the method comprises the steps of defining a calculation rule of two sub-indexes of a blind area coverage area, affecting the number of lanes to be counted according to blind area projection and lane intersection, quantifying the ratio of the number of covered nodes to the total number of the area nodes by road network nodes, presetting a traffic risk level scene weight dictionary, and taking a weight maximum value during multi-scene coverage; determining index comprehensive weights by adopting a analytic hierarchy process and an entropy weight process, constructing a hierarchical structure model comprising a target layer, a criterion layer and a scheme layer, scoring subjective weights by experts, calculating objective weights by historical data, and fusing and determining all the index weights according to a preset proportion; Acquiring required data through multisource sensing equipment, a road network database and real-time bulletin, and eliminating abnormal data after verification; Calculating the quantitative results of the number of influencing lanes, the road network node ratio and the traffic risk level of each blind area one by one; constructing a weighted summation formula based on the comprehensive weight, calculating a priority score and normalizing the priority score to a preset interval, wherein the higher the score is, the higher the priority is; setting a threshold value through K-means clustering or expert seminar, dividing high, medium and low priorities, and dynamically adjusting a threshold value interval subsequently; Judging the priority level of the dead zone, marking the dead zone information with high priority, generating a list, and synchronizing to a resource scheduling system; The scene change is monitored in real time, the index recalculation is triggered to update the priority score, the index weight is finely adjusted according to a preset period, the scheduling effect is combined with the optimal threshold value, and the dynamic traffic environment is adapted.
6. The vehicle-road collaborative multi-agent reinforcement learning scheduling method according to claim 1, wherein the construction of the dual-branch model based on global perception data output by a cross-agent dynamic perception fusion and blind area compensation technology, real-time detection and quantification of training and distribution offset of deployment environment, realization of online incremental update of MDP parameters through a Bayesian inference framework, rapid adaptation of sudden environment changes by combining with a meta learning module, and optimization of multi-scene hybrid training and domain self-adaptive regular term, comprises: extracting the state and action characteristics of global perception data, dividing a training reference data set and a deployment real-time data stream, and generating a deployment environment sample with the same format through a sliding time window after processing the characteristics; constructing a double-branch detection network, training a model and setting a dynamic threshold value by using Wasserstein distance, KL divergence quantization state and action space offset, and triggering MDP parameter adjustment by using the super threshold value; Modeling MDP parameters into conjugate prior distribution, extracting deployment environment interaction samples to construct an incremental data set, and realizing online incremental update of the parameters through variable decibel leaf optimization; constructing a burst scene meta-learning data set, training a meta-model based on MAML (maximum likelihood markup language), calling a model when the offset is suddenly increased, and finely adjusting a small amount of samples to quickly adapt to a burst environment; Constructing a diversified scene training library, labeling non-stationarity coefficients, and dynamically adjusting sampling weights to adapt to training requirements by adopting a scene rotation and weight sampling mechanism; The total loss of MARL is integrated with a domain countermeasure item, a domain discriminator is constructed, countermeasure training is realized through a gradient inversion layer, and general scene characteristics are learned; And periodically evaluating generalization performance by using a cross-scene test set, adjusting training samples, weights and reasoning methods according to the results, and updating model parameters by combining deployment feedback to realize continuous optimization.
7. The method for multi-agent reinforcement learning and scheduling under vehicle-road cooperation according to claim 6, wherein the constructing a dual-branch detection network respectively uses a wasperstein distance, a KL divergence quantization state and an action space offset, trains a model and sets a dynamic threshold, and the super-threshold triggers MDP parameter adjustment, and the method comprises the following steps: Splitting state features and action features of global perception data, dividing a training reference data set and a deployment detection data set, standardizing continuous features and coded discrete features, and generating detection samples with consistent formats through sliding time windows; constructing a double-branch network of a shared bottom layer feature extractor, adding a feature mapping layer to a state branch, directly butting the shared feature layer by an action branch, and outputting single-node offset by both branches; the state branch integrates a Wasserstein distance calculation module, the action branch is embedded with a KL divergence calculation module, a label data set is constructed, and a mean square error loss function is adopted for pre-training and fine tuning of the network until convergence; setting an initial threshold according to a 3 sigma principle based on the mean value and standard deviation of historical offset data, dynamically correcting by combining an environment complexity coefficient, and iteratively updating the threshold according to a preset period; And (3) inputting the deployment real-time samples to a dual-branch network, outputting two types of offset and comparing the offset with corresponding thresholds, and generating an MDP parameter adjustment trigger signal by any super threshold value and synchronously generating a detection report containing information.
8. The method for multi-agent reinforcement learning scheduling under vehicle-road cooperation according to claim 6, wherein the building a diversified scene training library and labeling non-stationarity coefficients, and dynamically adjusting sampling weight adaptation training requirements by adopting scene rotation and weight sampling mechanisms comprises: the vehicle-road collaborative scene dimension is combed, real and simulated multi-source scene data are collected, preprocessed, classified and stored, and a retrievable structured scene training library is constructed; defining characteristic fluctuation amplitude and scene switching frequency as non-stationarity coefficient quantization indexes, fusing calculation coefficients, and marking the whole scene through manual verification; setting a training batch level rotation period, classifying scenes according to coefficients, uniformly interleaving to form training batches, randomly scrambling rotation sequence, and ensuring scene exposure balance; Setting basic sampling weight, establishing a mapping relation with a non-stationarity coefficient, sampling according to the weight by adopting a specified method, and limiting the maximum sampling amount of a single scene; monitoring the sample utilization rate, training loss and migration accuracy, dynamically adjusting the weight according to the triggering condition, recalculating the coefficient at regular intervals and updating the sampling weight; And (3) regularly expanding a training library, carrying out cross-scene verification according to batches, pertinently adjusting sampling weights, and optimizing scene dimension and coefficient calculation logic.
9. The vehicle-road collaborative multi-agent reinforcement learning scheduling method according to claim 1, wherein the environmental parameters after the combination of non-stationary MDP dynamic adaptation and generalization enhancement mechanism adaptation are quantized by adopting a bayesian modeling mode to make decision uncertainty and trigger a corresponding decision mechanism, and the reward function is reconstructed and fused into a safety hard constraint through the anti-interference capability of an anti-training and fault injection enhancement model to interference factors occurring in the inter-agent perception process, so as to construct a multi-objective optimization system, which comprises: Acquiring state transition probability and rewarding function environment parameters which are output after non-stationary MDP dynamic adaptation and generalization enhancement, performing Bayesian modeling on decision-related parameters, adopting Dirichlet distribution as prior distribution aiming at the state transition probability, and adopting Gaussian distribution as prior distribution aiming at rewarding function; Combing interference factors in the inter-agent sensing process, designing a targeted countermeasure sample generation method aiming at various types of interference, constructing a mixed training data set of a conventional sample and a countermeasure sample, dynamically adjusting the duty ratio of the countermeasure sample, integrating the countermeasure sample into strategy network training, optimizing network parameters by adopting a mixed loss function, and forcing a model to learn robust feature representation in an interference scene; identifying typical fault types of a cross-intelligent agent sensing system, designing a fault injection mechanism, randomly triggering various faults in a model training process, simulating a real fault scene, constructing a fault scene training subset, inputting sensing data after fault injection as a model, optimizing self-adaptive adjustment capability of a strategy network in the fault scene, and enhancing anti-interference performance of the model on various faults; Disassembling multi-objective optimization dimension, reconstructing multi-dimensional rewarding function, defining rigid safety hard constraint, converting constraint into two execution mechanisms of non-movable shielding and strong punishment item, introducing constraint satisfaction index, and ensuring that the priority of safety constraint is higher than other targets; The method comprises the steps of determining a multi-objective optimization target set, determining the dynamic weight of each target by adopting an analytic hierarchy process and an entropy weight process, fixing the safety-related target weight to the highest level, and dynamically adjusting the other target weights according to the complexity of the environment; Constructing a comprehensive test data set covering a conventional scene, an interference scene, a fault scene and a high uncertainty scene, setting evaluation indexes, carrying out comprehensive evaluation by adopting the comprehensive test data set every time a preset training batch is completed, carrying out targeted optimization according to an evaluation result, updating prior distribution of a Bayesian model, an interference type library resisting training and scene coverage of fault injection on the basis of actual operation data of a deployment environment regularly, and continuously iterating and optimizing a multi-objective optimization system.
10. Multi-agent reinforcement learning scheduling system under vehicle-road cooperation, which is characterized by comprising: A processor; a machine-readable storage medium storing machine-executable instructions for the processor; Wherein the processor is configured to perform the multi-agent reinforcement learning scheduling method under vehicle-road cooperation of any one of claims 1 to 9 via execution of the machine-executable instructions.

Description

Multi-agent reinforcement learning scheduling method and system under vehicle-road cooperation Technical Field The invention relates to the technical field of pipeline safety monitoring, in particular to a multi-agent reinforcement learning scheduling method and system under vehicle-road cooperation. Background In a vehicle-road cooperative system, the problems of hardware difference and time-space dyssynchrony of multisource sensing equipment (vehicle-end cameras, radars, road-side laser radars and the like) are outstanding, so that sensing characteristic deviation and data matching precision are insufficient, a sensor is easily influenced by vehicle shielding, building shielding and the like to form a detection blind area, the traditional method is difficult to accurately complement blind area information, meanwhile, sensing resource scheduling lacks priority guiding, high-risk area insufficient sensing and low-risk area resource waste coexist, and data transmission and fusion efficiency are difficult to adapt to real-time scheduling requirements. In addition, intent sharing is not timely in the multi-agent interaction process, interaction prediction accuracy is low, global perception incompleteness is further aggravated, and overall efficiency of vehicle-road cooperative scheduling is restricted. The traditional multi-agent reinforcement learning scheduling method relies on fixed Markov Decision Process (MDP) parameters, is difficult to adapt to non-stationary characteristics such as abrupt change of traffic flow, sudden fault, extreme weather and the like in a vehicle-road cooperative scene, is insufficient in model generalization capability caused by environment distribution deviation, is insufficient in uncertainty quantification in the decision process, is weak in anti-interference and anti-fault capability, and is difficult to meet high-efficiency, safe and robust scheduling requirements in the vehicle-road cooperative scene due to the fact that a reward function design is focused on a single target in multiple sides and a safe hard constraint and multi-target cooperative optimization mechanism is lacked. Disclosure of Invention In view of the above-mentioned problems, in combination with the first aspect of the present invention, an embodiment of the present invention provides a multi-agent reinforcement learning scheduling method under vehicle-road cooperation, where the method includes: Realizing standardization and alignment of multisource perception features by establishing a unified space-time coordinate system, completing multisource data matching by adopting a cross-agent target association model, completing sensor blind area information by combining context reasoning, dynamically scheduling perception resources based on blind area priority, and realizing data nearby fusion by means of edge calculation; Based on global perception data output by a cross-agent dynamic perception fusion and blind area compensation technology, constructing a double-branch model, detecting and quantifying the distribution offset of training and deployment environments in real time, realizing MDP parameter online increment updating through a Bayesian inference framework, quickly adapting to sudden environment changes by combining a meta learning module, and optimizing through multi-scene mixed training and domain self-adapting regular terms; combining environment parameters after non-stationary MDP dynamic adaptation and generalization enhancement mechanism adaptation, quantifying decision uncertainty by adopting a Bayesian modeling mode, triggering a corresponding decision mechanism, reconstructing a reward function by resisting the anti-interference capability of an anti-training and fault injection enhancement model on interference factors occurring in the inter-agent sensing process, and integrating the reward function into a safe hard constraint, so as to construct a multi-objective optimization system; Based on the perception support provided by the cross-agent dynamic perception fusion and blind area compensation technology and the environment state after the non-stationary MDP dynamic adaptation and generalization enhancement mechanism adaptation, a lightweight intention coding scheme is designed, the intelligent agent intention high-efficiency sharing is realized through an on-demand broadcasting mechanism, a dynamic interaction diagram is constructed, the interaction result among the intelligent agents is predicted by utilizing a graph neural network, the decision security ensured by the robust training framework is combined, and the decision conflict is resolved in advance through a distributed conflict resolution algorithm and a collaborative excitation mechanism. In still another aspect, an embodiment of the present invention further provides a multi-agent reinforcement learning scheduling system under vehicle-road coordination, which is characterized by including: the system comprises a processor, a