CN-121999460-A - Bimodal information detection system and method based on automatic driving scene

CN121999460ACN 121999460 ACN121999460 ACN 121999460ACN-121999460-A

Abstract

The invention belongs to the technical field of computer vision and automatic driving, and particularly discloses a bimodal information detection system and method based on an automatic driving scene, wherein the system comprises an image acquisition module, a preprocessing module, a feature extraction module, a cross-scale fusion module, a detection head and an output module which are connected in sequence; the image acquisition module synchronously acquires RGB and Depth images, the preprocessing module completes data standardization, the feature extraction module is a feature enhancement fusion module based on a YOLO11 backbone network, the dual-mode robust feature extraction is realized through a parallel three-branch structure, the trans-scale fusion module is a fusion structure embedded into the CSF module, the trans-scale fusion effect is optimized by combining a dynamic weight distribution mechanism, the detection head module completes target classification and position regression, and the output module provides perception data for an automatic driving decision system; the bimodal collaboration mechanism constructed by the invention effectively solves the problems of limited single-modality detection, unbalanced feature extraction efficiency and robustness and trans-scale fusion information loss.

Inventors

LI BO
WANG GUOHUI
CHEN XI

Assignees

西安工业大学

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (9)

1. A bimodal information detection system based on an autopilot scenario, comprising: the image acquisition module is used for synchronously acquiring RGB images and Depth images in the automatic driving scene; The preprocessing module is used for normalizing, adjusting the size and registering the acquired RGB image and Depth image to obtain standardized bimodal data; the feature extraction module is a feature enhancement fusion module based on a YOLO11 backbone network and is used for respectively carrying out feature extraction on the standardized bimodal data and enhancing fusion to obtain single-mode enhancement features; The cross-scale fusion module is a PAFPN-CSF fusion structure based on a path aggregation feature pyramid network and embedded with a cross-scale feature fusion module at a cross-scale core fusion node thereof, and is used for outputting a fusion feature map covering different scale targets through cross-scale feature aggregation and enhancement on input single-mode enhancement features; the detection head module is used for carrying out target classification and position regression based on the fusion feature map to obtain a target detection result; And the output module is used for outputting a target detection result and providing environment perception data support for automatic driving decision.
2. The automatic driving scene-based bimodal information detection system according to claim 1, wherein the feature enhancement fusion module comprises a YOLO11 backbone network, a C3k2 module of the YOLO11 backbone network comprises a direct transfer branch and a deep extraction branch which are arranged in parallel, and an output end of the deep extraction branch is connected with a feature enhancement extraction module; The characteristic enhancement extraction module comprises a residual error connection branch, a multi-scale characteristic extraction branch and a variability convolution branch which are parallel; the residual connection branch is based on the standard convolution module and the initial characteristics input into the characteristic enhancement extraction module, so as to obtain residual characteristics; the multi-scale feature extraction branch extracts target features with different scales from input features by adopting convolution kernels with different sizes, and fuses the extracted target features with different scales to obtain multi-scale features; The variability convolution branch extracts local detail features from the input features based on the variability convolution to obtain special-shaped adaptation features; And fusing the residual error characteristic, the multi-scale characteristic and the special-shaped adaptation characteristic to obtain a single-mode enhancement characteristic.
3. The system for detecting bimodal information in an autopilot scenario according to claim 2, wherein the feature extraction and enhancement fusion of the normalized bimodal data are performed respectively to obtain a unimodal enhancement feature, and the method comprises: The dual-mode data input into the YOLO11 backbone network is split into two paths, and the two paths are respectively input into a direct transfer branch and a deep extraction branch which are arranged in parallel; The direct transfer branch reserves the original input characteristics, and the deep extraction branch inputs the output characteristics into the characteristic enhancement extraction module; The characteristic enhancement extraction module extracts residual characteristics, multi-scale characteristics and special-shaped adaptation characteristics from the input characteristics respectively, fuses the residual characteristics, the multi-scale characteristics and the special-shaped adaptation characteristics, and outputs enhancement characteristics; Fusing the characteristics of the direct transfer branch output with the enhancement characteristics output by the enhancement extraction characteristic module to obtain single-mode enhancement characteristics; and executing the process to complete feature extraction on the standardized bimodal data to obtain all the unimodal enhancement features.
4. The automated driving scenario-based bimodal information detection system of claim 3 wherein the PAFPN-CSF fusion architecture further comprises a top-down feature delivery path and a bottom-up feature delivery path; the top-down path realizes semantic guidance of high-dimensional features to low-dimensional features, and the bottom-up path realizes detail supplement of the low-dimensional features to the high-dimensional features; The cross-scale feature fusion module is used for carrying out feature enhancement, noise filtration and self-adaptive fusion on the single-mode enhancement features entering the cross-scale core fusion node.
5. The bimodal information detection system based on an autopilot scenario of claim 4 wherein the process of cross-scale aggregation and enhancement of the unimodal enhancement features by the cross-scale fusion module is: Inputting the single-mode enhanced features into a top-down feature transfer path, and realizing semantic guidance of high-dimensional features to low-dimensional features through up-sampling; Enhancing and noise filtering the single-mode enhancement feature through a cross-scale feature fusion module at a cross-scale core fusion node, and dynamically distributing fusion weights of an RGB mode and a Depth mode based on global feature statistical information; the shallow features enhanced by the trans-scale feature fusion module are transmitted into a bottom-up path, and detail supplementation of low-dimensional features to high-dimensional features is realized through downsampling; And outputting a cross-scale bimodal fusion feature map covering the P3 scale, the P4 scale and the P5 scale after the enhancement of the top-down feature transfer path, the bottom-up feature transfer path and the cross-scale feature fusion module.
6. The bimodal information detection system based on the automatic driving scene as claimed in claim 5, wherein the cross-scale feature fusion module is internally provided with a receptive field coordinate attention mechanism and comprises a grouping convolution unit, a coordinate attention unit, a resolution alignment unit and a weighted fusion unit which are sequentially connected, wherein the grouping convolution unit is used for grouping convolution of input bimodal features according to the number of preset channels and extracting spatial features in different receptive field ranges; The coordinate attention unit is used for generating a space attention weight based on the coordinate information of the feature map, strengthening the features of the region where the target is located and simultaneously inhibiting background interference noise; The resolution alignment unit is used for carrying out up-sampling or down-sampling processing on the high-low resolution features through a bilinear interpolation algorithm, so as to ensure that the resolutions of the modal features and the scale features entering the weighted fusion unit are completely consistent; the weighting fusion unit is internally provided with a global average pooling module and is used for extracting global statistical information of bimodal features, and the weighting fusion unit carries out self-adaptive weighting summation on the aligned features based on the global statistical information and outputs the enhanced bimodal fusion features.
7. A bimodal information detection method based on an automatic driving scene comprises the following steps: S1, synchronously acquiring an RGB image and a Depth image in an automatic driving scene; s2, carrying out normalization, size adjustment and registration processing on the acquired RGB image and Depth image to obtain standardized bimodal data; s3, respectively extracting features of the standardized bimodal data output in the step S2, and performing enhancement fusion to obtain single-mode enhancement features; s4, outputting a fusion feature map covering different scale targets through trans-scale feature aggregation and enhancement on the input single-mode enhancement features; s5, performing target classification and position regression based on the fusion feature map to obtain a target detection result; s6, outputting a target detection result, and providing environment perception data support for automatic driving decision.
8. A computing device, comprising: One or more processors; storage means for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the method of claim 7.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program which, when executed by a processor, implements the method according to claim 7.

Description

Bimodal information detection system and method based on automatic driving scene Technical Field The invention belongs to the technical field of computer vision and automatic driving, and particularly relates to a bimodal information detection system and method based on an automatic driving scene. Background The existing automatic driving environment sensing system is a core component for ensuring safe running of vehicles, and the core requirement is to accurately and real-timely identify various traffic targets in dynamic and complex road traffic scenes. At present, the mainstream target detection scheme of automatic driving environment perception mostly depends on single RGB mode image data, and the scheme can capture semantic texture information of a target, but is greatly influenced by environmental illumination conditions, namely, under a low illumination scene, the image detail is seriously lost, so that the target omission ratio is obviously increased, under severe weather scenes such as rain, snow, haze and the like, the image contrast is reduced, confusion is easily generated between the target and the background, and the detection precision is greatly reduced. In order to compensate for the defect of a single RGB mode, a Depth mode is gradually introduced into the field of target detection. The depth mode can effectively relieve interference caused by illumination change by acquiring the space geometric information of the target, but has obvious short plates, namely, the semantic features are missing, the target category is difficult to distinguish accurately, the geometric features of the small target are not obvious and are easy to judge by mistake, and the contour of the target cannot be completely complemented under the condition that the target is shielded. Therefore, the RGB-D bimodal fusion becomes a key technical direction for improving target detection performance in complex scenes, wherein feature layer fusion becomes a current mainstream fusion strategy because complementarity of bimodal data can be fully mined. However, the existing RGB-D bimodal fusion target detection technology still has the following problems: Firstly, the problem that the single-mode detection performance is limited is not fundamentally solved, the complementary characteristics of RGB semantic texture information and Depth space geometric information are not fully mined by the existing fusion scheme, and high-precision detection of complex scenes is difficult to realize by independent dependence on any mode or simple fusion; secondly, the characteristic extraction efficiency and the robustness are unbalanced, the capturing requirement of a multi-scale target and a special-shaped target and the network light-weight requirement are difficult to be considered by the existing characteristic extraction module, or a large amount of computational redundancy exists, or the characteristic expression capability is insufficient; accordingly, there is a need to devise a bimodal information detection system and method based on an autopilot scenario that ameliorates the above-mentioned problems. Disclosure of Invention In order to solve the problems in the prior art, the present invention provides a bimodal information detection system based on an autopilot scenario, comprising: the image acquisition module is used for synchronously acquiring RGB images and Depth images in the automatic driving scene; The preprocessing module is used for normalizing, adjusting the size and registering the acquired RGB image and Depth image to obtain standardized bimodal data; the feature extraction module is a feature enhancement fusion module based on a YOLO11 backbone network and is used for respectively carrying out feature extraction on the standardized bimodal data and enhancing fusion to obtain single-mode enhancement features; The cross-scale fusion module is a PAFPN-CSF fusion structure based on a path aggregation feature pyramid network and embedded with a cross-scale feature fusion module at a cross-scale core fusion node thereof, and is used for outputting a fusion feature map covering different scale targets through cross-scale feature aggregation and enhancement on input single-mode enhancement features; the detection head module is used for carrying out target classification and position regression based on the fusion feature map to obtain a target detection result; and the output module is used for converting the target detection result into a standardized format and transmitting the standardized format to the automatic driving decision system, so as to provide environment perception data support for automatic driving decision. Further, the feature enhancement fusion module comprises a YOLO11 backbone network, a C3k2 module of the YOLO11 backbone network comprises a direct transfer branch and a deep extraction branch which are arranged in parallel, and the output end of the deep extraction branch is connected with a feature enhan