CN-121982288-A - Multi-agent cooperative sensing method and system based on space-time joint selection
Abstract
The invention relates to a multi-agent collaborative perception method based on space-time joint selection, which comprises the steps of obtaining original perception data of a plurality of agents, constructing a space sparse region selection mask, generating a joint sparse feature map, and obtaining the agents Obtaining a fusion feature map, carrying out intermediate-level three-dimensional target detection, and carrying out later fusion with independent detection results of the cooperative intelligent agent to obtain a final cooperative perception detection result. The method and the device can still maintain higher perception performance under the scene with extremely limited bandwidth by constructing the space sparse region selection mask, reduce communication burden, enable transmission content to be focused on a dynamic target which really contributes to collaborative perception, improve information utilization rate, improve detection precision by utilizing multi-view complementary information, reduce detection omission and false detection risks caused by delay or shielding of intermediate fusion by a later compensation mechanism, and finally obtain a more accurate and complete 3D target detection result.
Inventors
- Pang Denghao
- Xiong Shuangyang
- ZHAO DAICHAO
- TANG XIWEN
- Luo Huishu
- YANG BO
Assignees
- 安徽大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260123
Claims (7)
- 1. A multi-agent cooperative sensing method based on space-time joint selection is characterized by comprising the following sequential steps: (1) Acquiring raw sensory data of multiple agents ; (2) Raw sensory data for each agent Performing feature coding to obtain multi-scale aerial view features (BEV features), and constructing a space sparse region selection mask based on supply-demand relations ; (3) Selecting masks based on spatially sparse regions Introducing time sequence dynamic information to further reduce cross-time redundancy and generate a joint sparse feature map ; (4) For joint sparse feature map After sparse feature compression and quantization coding, an intelligent agent is obtained Is of (1) Transmitting among a plurality of intelligent agents through a communication module; (5) The self-vehicle intelligent agent performs unified space-time attention fusion of multi-scale and multi-time frames on the received sparse features from the dynamic states of a plurality of intelligent agents and the local historical static features to obtain a fusion feature map ; (6) Based on fusion feature map Performing intermediate-level three-dimensional target detection, and performing later fusion with independent detection results of the cooperative intelligent agent to obtain a final cooperative sensing detection result The final cooperative sensing detection result Including target three-dimensional bounding boxes, categories, confidence, and multi-agent sharing supplemental information.
- 2. The method for collaborative sensing of multiple agents based on spatiotemporal joint selection according to claim 1, wherein step (2) comprises the following sequential steps: (2a) Raw perceptual data Inputting the main network of the feature coding network to obtain the intelligent agent BEV feature map of (2) : , Wherein, the The number of the agent is given to the agent, , Is the number of agents; Representing the current time; the height, width and number of feature channels of the BEV feature map, respectively; (2b) To synergistic agent Constructing a feed mask based on perceptual confidence information The method is used for representing the capability of the collaborative agent to provide effective perception information in a corresponding spatial region: ‘ Wherein, the Is an indication function; confidence representing point cloud pilar features or image features; is a supply threshold; (2c) To car intelligent object Constructing a demand mask based on point cloud density and occlusion conditions : , In the formula, Is the local point cloud density; Is a preset demand threshold value, if Less than Will be The corresponding region is judged to have higher collaborative perception requirements, otherwise, the region is considered to be weaker in requirements or is considered to be a non-required region; (2d) According to the feed mask And a demand mask Generating spatially sparse region selection masks : , Wherein, the Representing element-by-element multiplication; Is that in Time of day, collaboration agent According to its own supply mask And self-vehicle intelligent body Is a requirement mask of (2) A spatially sparse region selection mask is generated.
- 3. The multi-agent cooperative sensing method based on space-time joint selection according to claim 1, wherein the step (3) specifically comprises the following sequential steps: (3a) For synergistic agent First, a saliency map is calculated : , Wherein, the Representing a classification header shared with the detection module; Is a Sigmoid function; representing synergistic agent BEV feature maps of (2); The height and width of the BEV feature map, respectively; subsequently, a dynamic intensity is calculated that characterizes the degree of variation of the local region in the time dimension , The significance difference between adjacent time frames is obtained by: ; In the formula, Representation of A time of day saliency map; When cooperated with intelligent agent Dynamic intensity of a region on a BEV feature map of (C) When the time sequence of the region is higher than a preset threshold value, the region is considered to have obvious time sequence change characteristics, namely the region is considered to be a dynamic region, otherwise, the region is considered to be a static region; (3b) According to dynamic intensity Generating a binary dynamic mask : ; Wherein, the Is a parameter controlling the weight of dynamic information; Is an indication function; A dynamic intensity threshold for distinguishing a dynamic region from a static region; (3c) Masking spatially sparse region selection And binary dynamic mask Performing joint constraint to generate joint mask : ; (3D) Generating a joint sparse feature map : ; In the formula, Representing an agent BEV feature maps of (2); The number of feature channels representing the BEV feature map.
- 4. The multi-agent cooperative sensing method based on space-time joint selection according to claim 1, wherein the step (4) specifically comprises the following sequential steps: (4a) Will combine sparse feature maps The method comprises the steps of inputting the characteristic compression processing to a sparse convolution encoder, wherein the sparse convolution encoder adopts a multi-layer sparse convolution structure, and the convolution kernel size, the step length and the output channel number of the sparse convolution encoder are configured according to actual requirements, the sparse convolution encoder has the following structure that a first layer uses a convolution kernel of 3 multiplied by 3, the step length is 2, the output channel is 64, a second layer uses a convolution kernel of 3 multiplied by 3, the step length is 2, the output channel is 128, a third layer uses a convolution kernel of 3 multiplied by 3, the step length is 2, the output channel is 256, and a fourth layer uses a convolution kernel of 1 multiplied by 1, and the output compressed channel number is 32; (4b) Channel characterization for sparse convolutional encoder output Carrying out channel-by-channel Min-Max normalization to map the characteristic value of each channel to the [0,1] interval to obtain normalized characteristics : ; Wherein, the , Representing a channel index; A small positive number to prevent zero removal; subsequently, for the normalized features Performing uniform quantization: ; In the formula, Representing a downward rounding; , Representation of Quantized encoded values of the channel; for quantization bit width; the characteristics after normalization and uniform quantization are further compressed and encoded by a self-encoder to obtain the intelligent agent Is of (1) The self-encoder comprises an encoder and a decoder, wherein the encoder is used for mapping quantized sparse features into low-dimensional potential representations so as to reduce the data quantity required to be transmitted, and the decoder is deployed at a receiving end and used for reconstructing corresponding feature representations according to the received potential representations.
- 5. The method for collaborative sensing of multiple agents based on spatiotemporal joint selection according to claim 1 wherein step (5) comprises the steps of: (5a) The self-vehicle intelligent body is at moment Receive from an agent Is of (1) Reading the intelligent agent from the memory bank by the self-vehicle intelligent agent At the last moment Stored reconstruction features And will sparse features And reconstructing features Inputting the characteristics into a characteristic reconstruction network for fusion completion to obtain the complete reconstruction characteristics of each intelligent agent at the current moment : ; Wherein, the Representing feature reconstruction module, reconstructing features Is written into a memory bank for the subsequent time sequence characteristic reconstruction process; (5b) Injecting the intelligent agent meta-information into the reconstruction features of each intelligent agent to perform feature alignment, wherein the intelligent agent meta-information comprises the communication time delay, the movement speed, the intelligent agent number and the intelligent agent type of the intelligent agent so as to compensate the difference of different intelligent agents in time sequence state and movement state, and then extracting the self-vehicle intelligent agent from a memory bank at the last moment Historical fusion feature map obtained by fusing multiple intelligent body reconstruction features and self-vehicle intelligent body features through unified space-time attention fusion module On the basis, a unified space-time characteristic queue is constructed : ; Wherein, the Delay frame number for actual communication; represent the first The individual agents are at the current moment BEV features after feature reconstruction; is the number of agents; (5c) Original perception data of current moment of own vehicle Obtaining complete BEV features from a backbone network of an input feature encoding network And by Feature vectors for each spatial location in a query as query features For unified space-time feature queues Applying a shared linear transformation to each feature map of the model to obtain a value feature for sampling : ; Wherein, the Is a learnable linear transformation matrix; representing unified spatiotemporal feature queues The first of (3) A feature map; the height, width and number of feature channels of the BEV feature map, respectively; for query features Two-dimensional reference point corresponding to the same In the first place On the head of attention, the multi-layer perceptron is used for inquiring the characteristics according to the inquiry Direct prediction of its two-dimensional reference point In the first place On individual spatiotemporal feature maps The number of deformable sampling offsets and corresponding attention weights: ; Wherein, the Representing the sampling point relative to a two-dimensional reference point Is a two-dimensional offset of (2); representing the attention weight of the corresponding sample point; represent the first A prediction network corresponding to the attention heads; Representing the number of deformable sampling points predicted and sampled for each query feature on a single attention head, single feature map; (5d) In the first place In the individual attention heads, two-dimensional reference points Aggregation results at the location are obtained by queuing unified space-time characteristics And carrying out weighted summation on all the feature graphs in the database and the corresponding sampling points of each feature graph to obtain: ; Wherein, the Is the first Linear transformation matrixes corresponding to the attention heads; representation from the first by bilinear interpolation The feature values obtained by sampling in the space-time feature graphs; First, the The attention heads are at two-dimensional reference points Output characteristics at; the outputs of all the attention heads are spliced in the channel dimension, and a fusion characteristic diagram is obtained through output projection and a feedforward network : ; Wherein, the Representing a concatenation operation performed in a feature channel dimension to provide a feature channel count from Restoring to complete channel number ; FFN represents a feedforward network module for further improving the expression capability of fusion characteristics; represent the first The output characteristics of the attention heads at all query positions are obtained by carrying out deformable weighted sampling on the unified space-time characteristic queue; 1 , 1 Is the total number of attention heads.
- 6. The multi-agent cooperative sensing method based on space-time joint selection according to claim 1, wherein the step (6) specifically comprises the following sequential steps: (6a) Based on fusion feature map Generating intermediate detection results Fusion of feature graphs Inputting the three-dimensional target detection head to generate an intermediate detection result of the self-vehicle intelligent body at the current moment Intermediate detection results Includes three-dimensional bounding box, category and confidence of target, and at the same time, receiving independent detection results from other collaborative agents Independent detection results of the collaborative agent Each cooperative agent locally inputs the local BEV characteristic of the current moment to a three-dimensional target detection head consistent with the structure of the detection head of the own vehicle agent to generate, and sends the three-dimensional target detection head to the own vehicle agent through a communication link; (6b) Independent detection results for synergistic agent According to the corresponding confidence Screening, discarding confidence coefficient lower than preset threshold value For the target detection result remained after screening, multiplying the corresponding confidence coefficient by the inhibition coefficient , Obtaining independent detection results after inhibition Independent detection results after inhibition of all synergistic agents Is marked as a collection , Is the number of agents; (6c) Intermediate detection result of self-vehicle intelligent body And aggregate with Merging, and executing three-dimensional non-maximum value inhibition processing on the merged detection result under a unified coordinate system to obtain a final collaborative perception detection result 。
- 7. A system for implementing the multi-agent collaborative sensing method based on spatiotemporal joint selection of any of claims 1-6, comprising: the feature code is used for carrying out feature extraction and coding on the original perception data acquired by each agent, and generating a multi-scale aerial view feature (BEV feature) for subsequent cooperative processing so as to realize unified spatial representation of environmental information; the supply-demand sparse selection module is used for constructing a supply mask based on the perception confidence information of the cooperative intelligent agent and constructing a demand mask based on the point cloud density or shielding condition of the self-vehicle intelligent agent so as to describe the cooperative perception supply-demand relationship of different space areas; The time-space dynamic detection is used for analyzing the change condition of the feature significance between adjacent time frames, detecting the time sequence dynamic strength of each space region, and distinguishing the dynamic region from the static region so as to reduce the transmission of cross-time redundant information; The joint mask generation module is used for carrying out joint constraint on the supply and demand sparse selection result and the space-time dynamic detection result to generate a joint mask, and screening to obtain a joint sparse characteristic region needing cooperative transmission according to the joint mask; The sparse compression and communication module is used for carrying out sparse convolution compression, normalization, quantization and coding on the joint sparse features, and transmitting compressed feature information between multiple intelligent agents through a communication link so as to reduce the communication overhead of collaborative perception; The unified space-time attention fusion module is used for constructing a unified space-time feature queue from the received cooperative agent features, the history features of the vehicle and the current perception features, and realizing the alignment and fusion of the multi-agent and multi-time frame features through a unified space-time attention mechanism; the confidence weighting middle-later fusion detection module generates a middle detection result based on fusion characteristics, screens, weights and suppresses independent detection results from the cooperative intelligent agents according to the confidence thereof, and completes middle-later fusion detection by combining the self-vehicle detection results; And the result output module is used for carrying out post-processing on the fusion detection result and outputting a final collaborative perception detection result.
Description
Multi-agent cooperative sensing method and system based on space-time joint selection Technical Field The invention relates to the technical field of intelligent transportation and automatic driving, in particular to a multi-agent cooperative sensing method and system based on space-time joint selection. Background With the continuous development of automatic driving technology and intelligent traffic systems, the perception capability of vehicles to the surrounding environment becomes a key factor for guaranteeing driving safety and traffic efficiency. The traditional bicycle sensing mode mainly relies on a vehicle-mounted sensor to acquire environment information, is limited by the detection distance and the visual angle range of the sensor and shielding factors in complex traffic scenes, is difficult to realize accurate and complete sensing of global environments, and particularly, the sensing performance is easy to be obviously reduced in scenes with dense multi-bicycle interaction or serious shielding. To overcome the limitations of bicycle sensing, multi-agent cooperative sensing techniques have been developed. According to the technology, through a communication mode between vehicles (V2V) or vehicles (V2I), real-time sharing of perception information among a plurality of intelligent agents is realized, so that environment information from different visual angles and different positions is fused, the perception range can be obviously expanded theoretically, shielding blind areas are eliminated, and the accuracy and the robustness of target detection are improved. However, the existing collaborative sensing scheme has the following defects: Firstly, the perceived supply and demand relation among multiple intelligent agents is not fully excavated, so that a large number of areas which can be reliably perceived by a vehicle or remain static in the time dimension are repeatedly transmitted; Secondly, the dynamic target area and the static background area are not effectively distinguished, and a large amount of redundant background information still needs to be transmitted in the time dimension, so that the communication burden is further increased; thirdly, the influence of factors such as time delay and the like on the time-space consistency in a real communication environment is not fully considered, and the traditional multi-stage or multi-level time-space fusion mechanism has high computational complexity; Fourth, in a scenario where the communication bandwidth is extremely limited, the intermediate fusion accuracy is significantly reduced, and an effective post-compensation or error correction mechanism is lacking, so that stability and reliability of the collaborative sensing result are difficult to ensure. Disclosure of Invention In order to solve the problems that the prior multi-agent cooperative sensing technology has serious redundancy of sensing information, fails to effectively distinguish dynamic and static targets, has large influence on time delay in a real communication environment and obviously reduces middle fusion precision under a low bandwidth condition, the primary aim of the invention is to provide a multi-agent cooperative sensing method based on time-space joint selection, which can obviously reduce data quantity, still maintain higher sensing performance under a scene with extremely limited bandwidth, effectively relieve the time-space dislocation problem caused by time delay in real V2V/V2I communication and improve detection precision by utilizing multi-view complementary information. In order to achieve the aim, the invention adopts the following technical scheme that the multi-agent collaborative sensing method based on space-time joint selection comprises the following steps in sequence: (1) Acquiring raw sensory data of multiple agents ; (2) Raw sensory data for each agentPerforming feature coding to obtain multi-scale aerial view features (BEV features), and constructing a space sparse region selection mask based on supply-demand relations; (3) Selecting masks based on spatially sparse regionsIntroducing time sequence dynamic information to further reduce cross-time redundancy and generate a joint sparse feature map; (4) For joint sparse feature mapAfter sparse feature compression and quantization coding, an intelligent agent is obtainedIs of (1)Transmitting among a plurality of intelligent agents through a communication module; (5) The self-vehicle intelligent agent performs unified space-time attention fusion of multi-scale and multi-time frames on the received sparse features from the dynamic states of a plurality of intelligent agents and the local historical static features to obtain a fusion feature map ; (6) Based on fusion feature mapPerforming intermediate-level three-dimensional target detection, and performing later fusion with independent detection results of the cooperative intelligent agent to obtain a final cooperative sensing detection resultThe final coo