CN-121999242-A - Traffic shielding visual large model light-weight method and target detection device

CN121999242ACN 121999242 ACN121999242 ACN 121999242ACN-121999242-A

Abstract

The application discloses a traffic shielding vision large model light-weight method and a target detection device, and belongs to the technical field of intelligent traffic and computer vision. The method comprises the steps of adaptively adjusting convolution sampling positions and weights through a dynamic deformation convolution module, enhancing feature extraction of shielding targets, fusing low-layer detail information through a P-SOCC multi-scale shielding small target detection layer, improving feature representation capability of the small targets and shielding targets, integrating the modules into a pre-training large model, achieving light deployment through a staged training strategy, and improving calculation efficiency while guaranteeing detection accuracy.

Inventors

LAN JINHUI
Sun Yongao

Assignees

北京科技大学

Dates

Publication Date: 20260508
Application Date: 20260104

Claims (10)

1. The traffic shielding visual large model light weight method is characterized by comprising the following steps of: A dynamic deformation convolution module is deployed at the middle and low layers of the feature extraction backbone network of the target detection model to replace a standard convolution layer, and the dynamic deformation convolution module is configured to generate sampling point offset and modulation factors according to an input feature map so as to dynamically adjust sampling positions and sampling weights of convolution operation; Introducing a multi-scale shielding small target detection layer P-SOCC into a feature fusion network of the target detection model, wherein a P-SOCC module is configured to extract features from a low-layer high-resolution feature map in the feature fusion network, perform feature processing through spatial pyramid expansion convolution to extract multi-scale fine granularity structure information, perform multi-branch fusion on the processed features through a CSP feature integration module, and inject the fused features into a high-layer feature layer of the feature fusion network; Integrating the dynamic deformation convolution module and the P-SOCC module into a pre-trained large-scale target detection model to form a lightweight target detection model, and training the lightweight target detection model by adopting a staged training strategy.
2. The method of claim 1, wherein the dynamic deformation convolution module performs a convolution operation, wherein the value of each position of the output feature map is obtained by sampling and weighting and summing the input feature map, wherein the sampling position is adjusted according to an offset of the input feature adaptive prediction on the basis of a predefined rule grid, and the contribution degree of each sampling point is weighted by a modulation factor of the input feature adaptive prediction, and the offset and the modulation factor are dynamically generated by a lightweight sub-network.
3. The method of claim 1 or 2, wherein the offset in the dynamically deformed convolution module And modulation factor And dynamically generating by a lightweight subnetwork according to the input characteristic diagram.
4. The method according to claim 1, wherein the CSP feature integration module comprises at least three parallel branches that transform input features using large-receptive-field convolution kernels, wide-kernel convolution, and small-kernel convolution, respectively, to extract global, wide-scale, and local features simultaneously.
5. The method according to claim 4, wherein the output of the CSP feature integration module is obtained by stitching and linear fusion convolution of features transformed by parallel branches.
6. The method of claim 1, wherein the phased training strategy specifically comprises: The first stage, freezing network parameters except the dynamic deformation convolution module and the P-SOCC module in the pre-training large-scale target detection model, and training the dynamic deformation convolution module and the P-SOCC module; And in the second stage, thawing part or all network parameters of the pre-trained large-scale target detection model, and carrying out combined training with the dynamic deformation convolution module and the P-SOCC module.
7. The method of claim 1, wherein the lightweight object detection model is optimized using a combined Loss function that fuses the Focal local component and Distance-IoU component.
8. A traffic-obstructing visual target detection device for implementing the method of any one of claims 1 to 7, comprising: The feature extraction backbone network is characterized in that a dynamic deformation convolution module is arranged at a lower layer, and the dynamic deformation convolution module is configured to adaptively generate sampling point offset and modulation factors according to an input feature map so as to dynamically adjust sampling positions and sampling weights of convolution operation; The feature fusion network is integrated with a multi-scale shielding small target detection layer P-SOCC, and the P-SOCC module is configured to extract features from a low-layer high-resolution feature map in the feature fusion network, perform feature processing through spatial pyramid expansion convolution pairs to extract multi-scale fine granularity structure information, perform multi-branch fusion on the processed features through a CSP feature integration module, and inject the fused features into a high-layer feature layer of the feature fusion network; The training and integrating unit is configured to integrate the feature extraction backbone network and the feature fusion network into a pre-trained large-scale target detection model to form a lightweight target detection model, and is used for training the lightweight target detection model by adopting a staged training strategy.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-7 when executing the computer program.

Description

Traffic shielding visual large model light-weight method and target detection device Technical Field The application relates to the technical field of intelligent traffic and computer vision, in particular to a traffic shielding visual large model light-weight method based on dynamic deformation convolution multi-scale feature extraction, a traffic shielding visual target detection device and electronic equipment. Background In the intelligent traffic visual perception task, serious shielding and target detection in complex scenes are always important research directions. The phenomena of mutual shielding among vehicles, shielding of pedestrian parts and the like which are commonly existed in traffic scenes bring higher requirements to the feature extraction network. However, the technical path to solve this problem faces a core tradeoff between model size and computational efficiency. The large model has strong characteristic characterization capability by virtue of a parameter scale of hundred million-level magnitude and a deep network structure, and can capture fine semantic association and context dependency between an occlusion object and an occluded target. By learning abundant prior knowledge of the shielding mode in the mass data, the large model can effectively improve the detection robustness in complex scenes. However, its high computational complexity and memory footprint make it challenging to real-time when edge computing devices are deployed. The small model reduces power consumption through a simplified network structure and a parameter quantization technology while maintaining the response speed of tens of milliseconds, but has limited feature expression capability, and is difficult to cope with the problems of feature fragmentation and semantic incompleteness caused by shielding. For a specific scene of traffic scene high speed gear, the dual requirements of shielding complexity and real-time constraint are comprehensively considered in model selection. On one hand, the feature fragmentation and semantic incompleteness of the shielding target require that the model has depth context understanding and cross-target relation modeling capability, which naturally tends to adopt a large model architecture, particularly in high-speed, high-dynamic and high-risk scenes such as urban scene intersections, dense traffic flows, accident multiple road segments and the like, the depth semantic reasoning capability of the large model is critical for distinguishing shielding boundaries, complementing target contours and identifying abnormal behaviors, and on the other hand, the strict power consumption and delay limitation of a vehicle-mounted edge computing platform and road side units also form rigid constraint on model complexity. In contrast, in applications with relatively simple scene structures and severe real-time requirements, such as expressway free flow, parking lot monitoring, signal lamp control and the like, the small model can better meet the low-delay and low-cost deployment requirements. Therefore, a simple large model or a small model is not an optimal solution, and the core contradiction is how to realize light deployment of the large model on the premise of continuous detection precision, or to enhance the shielding processing capability of the small model while maintaining the high efficiency of the small model. The current mainstream technical solution is to try to alleviate the contradiction between feature extraction and light weight two dimensions, but there are still obvious limitations. In the aspect of feature extraction, common convolution is taken as a basic mode, a convolution kernel with fixed size and a regular sampling grid are adopted, and the feature is extracted through sliding weighting and operation. The method is simple in calculation and favorable for parallel optimization of hardware, but the fixed receptive field design is difficult to adjust in a self-adaptive manner when facing an occlusion target. In order to improve the adaptability to shielding targets, researchers put forward improvements such as deformable convolution, the spatial positions of sampling points are dynamically adjusted through a parallel offset prediction network, an offset field is predicted by a conventional convolution layer, bilinear interpolation sampling is performed, and finally the sampled characteristics are processed. Although the adaptability to space transformation is enhanced to a certain extent by the design, the flexibility of local sampling adjustment is still insufficient to establish global context correlation crossing the occlusion under a severe occlusion scene, and the fundamental conflict between large model calculation overhead and insufficient small model characterization capability cannot be fundamentally solved. In the aspect of feature pyramid design, a path aggregation network and variants thereof combine transverse connection fusion deep semantic information and