CN-121686401-B - Vehicle-mounted non-blind area multi-target intrusion sensing system based on vision
Abstract
The invention relates to a vision-based vehicle-mounted non-blind area multi-target intrusion sensing system, belongs to the technical field of target recognition, and solves the problem of poor target recognition capability at night in the prior art. The system comprises an image acquisition module and a far-field environment sensing module, wherein the image acquisition module comprises a far-field image acquisition device and is used for acquiring RGB images, infrared images and depth images of objects contained in a far field, the far-field environment sensing module is used for respectively preprocessing the RGB images, the infrared images and the depth images acquired by the far-field image acquisition device, inputting a trained multi-mode feature extraction fusion network to conduct multi-mode feature extraction and fusion, outputting spatial positioning parameters of the objects by the multi-mode feature extraction fusion network, and acquiring the size and the position of the objects by utilizing the spatial positioning parameters. Accurate multi-target intrusion awareness is achieved.
Inventors
- DONG YANPENG
- WANG JIAYU
- LIU LEI
- WANG XIAOJUN
Assignees
- 北京机械设备研究所
Dates
- Publication Date
- 20260508
- Application Date
- 20250313
Claims (4)
- 1. A vision-based vehicle-mounted non-blind area multi-target intrusion sensing system is characterized by comprising an image acquisition module and a far-field environment sensing module, wherein, The image acquisition module comprises a far-field image acquisition device and is used for acquiring RGB images, infrared images and depth images containing targets in a far field; The far-field environment sensing module is used for respectively preprocessing an RGB image, an infrared image and a depth image acquired by the far-field image acquisition device, inputting a trained 3D target detection network based on multiple modes to perform multi-mode feature extraction and fusion, and outputting space positioning parameters of each target object by the multi-mode feature extraction and fusion network; The multi-modal feature extraction fusion network comprises a first backbone network module, a second backbone network module, a third backbone network module and a feature fusion module, wherein the first backbone network module is used for carrying out multi-scale feature extraction on the preprocessed RGB image based on the interaction information of the RGB image and the infrared image, the second backbone network module is used for carrying out multi-scale feature extraction on the preprocessed infrared image based on the interaction information of the RGB image and the infrared image, and the third backbone network module is used for carrying out multi-scale feature extraction on the preprocessed depth image; The first, second and third backbone networks comprise five convolution fusion modules which are respectively used for obtaining first to fifth RGB feature images, infrared feature images and depth feature images, wherein the first convolution fusion module comprises ConvBNSiLU layers, the second to fourth convolution fusion modules comprise ConvBNSiLU layers and BottleneckCSP layers which are sequentially connected, the fifth convolution fusion modules of the first backbone network module and the second backbone network module comprise ConvBNSiLU layers and BottleneckCSP layers, the fifth convolution fusion module of the third backbone network module comprises ConvBNSiLU layers, the output of the fifth convolution fusion modules of the first backbone network module and the second backbone network module are added and then input into an SPPF layer, the ConvBNSiLU layers are used for extracting features of the input images, the BottleneckCSP layers are used for preventing gradient disappearance phenomenon of the networks in the reverse propagation process, and the SPPF layer is used for extracting features of the input added RGB feature images and infrared feature images; the feature fusion module comprises first to fourth feature fusion modules, wherein the first feature fusion module comprises a first Concat layer, the first Concat layer is used for splicing the added third RGB feature map, the infrared feature map and the depth feature map with the corresponding scale to obtain a spliced first spliced feature map, the second feature fusion module comprises a second Concat layer, the second Concat layer is used for splicing the added fourth RGB feature map, the infrared feature map and the depth feature map with the corresponding scale to obtain a spliced second spliced feature map, the third feature fusion module comprises a third Concat layer, the third Concat layer is used for splicing the feature map output by the SPPF and the depth feature map with the corresponding scale to obtain a spliced third spliced feature map, and the fourth feature fusion module is used for further extracting fusion features from the spliced first to third spliced feature maps to obtain space positioning parameters of each object; Based on the interaction information, carrying out multi-scale feature extraction on the preprocessed RGB image and the preprocessed infrared image, wherein the multi-scale feature extraction comprises the steps of respectively setting a transducer module at the output ends of a second convolution fusion module, a third convolution fusion module and a fourth convolution fusion module in a first backbone network and a second backbone network; the method comprises the steps of carrying out information interaction on RGB feature images and infrared feature images with the same scale by utilizing a transducer module to respectively obtain a fused RGB feature image and a fused infrared feature image after interaction information, adding the fused RGB feature image and the RGB feature image input to the transducer module to serve as input of a next-stage convolution fusion module in a first backbone network, adding the fused infrared feature image and the infrared feature image input to the transducer module to serve as input of the next-stage convolution fusion module in a second backbone network; The fourth feature fusion module comprises an FPN module, a PAN module and a detection module; the FPN module comprises a first ConvBNSiLU layer, a first upsampling layer, a fourth Concat layer, a first BottleneckCSP layer, a second ConvBNSiLU layer, a second upsampling layer, a fifth Concat layer and a second BottleneckCSP layer which are sequentially connected; the system comprises a first ConvBNSiLU layer, a fourth Concat layer, a fifth Concat layer, a third BottleneckCSP layer, a fourth ConvBNSiLU layer, a seventh Concat layer and a fourth BottleneckCSP layer, wherein the first ConvBNSiLU layer is used for carrying out feature extraction on an input third spliced feature map, the other input end of the fourth Concat layer is connected with a second spliced feature map, the other input end of the fifth Concat layer is connected with the first spliced feature map, the PAN module comprises a third ConvBNSiLU layer, a sixth Concat layer, a third BottleneckCSP layer, a fourth ConvBNSiLU layer, a seventh Concat layer and a fourth BottleneckCSP layer which are sequentially connected, the other input of the third ConvBNSiLU layer is the output of a second ConvBNSiLU layer in an FPN module, the other input of the fourth ConvBNSiLU layer is the output of a first ConvBNSiLU layer in the FPN module, the detection module comprises a first detection module, a second detection module and a third detection module, and spatial positioning parameters for returning objects are output of a second BottleneckCSP layer in the N module, the input of the first detection module is the output of the third detection module, the input of the second detection module is the third detection module, and the output of the third detection module is the third detection module of the third detection module is the spatial positioning parameters of objects; The space positioning parameters comprise the target category, the center point coordinates, the center point depth, the center point offset, the external frame and the yaw angle of each target object.
- 2. The vision-based vehicle-mounted blind zone-free multi-target intrusion sensing system according to claim 1, wherein the image acquisition module further comprises a near-field image acquisition unit for obtaining a fisheye image containing a target object in a near field; The near-field image collector is a fisheye camera vertically arranged on each side face of the vehicle body and used for respectively obtaining each fisheye image containing a target object in the near field, wherein one fisheye camera is respectively arranged in the center of the front side and the rear side of the vehicle body, two fisheye cameras are respectively and uniformly arranged on the left side and the right side of the vehicle, and the heights of the fisheye cameras are consistent.
- 3. The vision-based vehicle-mounted blind zone-free multi-target intrusion sensing system according to claim 2, further comprising a near-field environment sensing module for obtaining 3D information of the target object by utilizing each fisheye image based on BEVformer algorithm.
- 4. The vision-based vehicular non-blind zone multi-target intrusion sensing system of claim 1, wherein the far-field image collector is a centralized panoramic camera; the base of the centralized panoramic camera is a regular hexagonal prism and is vertically arranged at the center of the top end of the vehicle, the distance from the center to the vehicle is fixed, one side face of the base is opposite to the front of the vehicle, each side face of the base is provided with an RGB micro-light camera, an infrared camera and a TOF camera which are respectively used for collecting RGB images, infrared images and depth images of objects in a far field, and each camera is perpendicular to the corresponding side face of the base.
Description
Vehicle-mounted non-blind area multi-target intrusion sensing system based on vision Technical Field The invention relates to the technical field of target recognition, in particular to a vision-based vehicle-mounted non-blind area multi-target intrusion sensing system. Background In the field, the important parts which need long-term supervision, such as mining areas or protection areas, unmanned substations, military grounds and the like, if personnel invasion is found, the invasion information is required to be recorded and reported, and the attendees are notified to carry out patrol. The traditional on-duty scheme mostly uses an electronic fence and a cloud deck, wherein the electronic fence can find an invasion condition, but cannot identify the invasion position and the moving direction of an invader, and in addition, the electronic fence cannot acquire the specific number of the invader when the number of the invader is large. The camera can identify the position and the moving direction of an invader and can judge the number of the invaders approximately, but because the view angle of the camera is limited, all areas cannot be covered, so that a blind area exists in shooting, the geometric dimension of the invader is difficult to judge, in addition, the influence of illumination on the camera is larger, and particularly, the camera is difficult to effectively detect the obstacles around a vehicle body in an environment with insufficient illumination. Disclosure of Invention In view of the above analysis, the embodiment of the invention aims to provide a vision-based vehicle-mounted non-blind area multi-target intrusion sensing system, which is used for solving the problem of poor target recognition capability at night in the prior art. The embodiment of the invention provides a vision-based vehicle-mounted non-blind area multi-target intrusion sensing system which comprises an image acquisition module and a far-field environment sensing module, wherein, The image acquisition module comprises a far-field image acquisition device and is used for acquiring RGB images, infrared images and depth images containing targets in a far field; the far-field environment sensing module is used for respectively preprocessing the RGB image, the infrared image and the depth image acquired by the far-field image acquisition device, inputting a trained multi-mode-based 3D target detection network to perform multi-mode feature extraction and fusion, outputting space positioning parameters of each target object by the multi-mode feature extraction fusion network, and obtaining the size and the position of each target object by utilizing the space positioning parameters. Based on the further improvement of the method, the multi-mode feature extraction fusion network comprises a first backbone network module, a second backbone network module, a third backbone network module and a feature fusion module, wherein, The first backbone network module is used for extracting multi-scale characteristics of the preprocessed RGB image based on the interaction information of the RGB image and the infrared image; the second backbone network module is used for extracting multi-scale characteristics of the preprocessed infrared image based on the interaction information of the RGB image and the infrared image; The third backbone network module is used for extracting multi-scale characteristics of the preprocessed depth image; And the feature fusion module is used for adding the RGB feature images and the infrared feature images with the same scale and then splicing the RGB feature images and the infrared feature images with the depth feature images with the corresponding scale to obtain spliced feature images, and further extracting fusion features from the spliced feature images to obtain the space positioning parameters of each object. Based on the further improvement of the method, the first backbone network, the second backbone network and the third backbone network respectively comprise five convolution fusion modules which are respectively used for obtaining a first RGB feature map, an infrared feature map and a depth feature map, The first convolution fusion module comprises ConvBNSiLU layers, the second to fourth convolution fusion modules comprise ConvBNSiLU layers and BottleneckCSP layers which are sequentially connected, the fifth convolution fusion module of the first backbone network module and the second backbone network module comprises ConvBNSiLU layers and BottleneckCSP layers, the fifth convolution fusion module of the third backbone network module comprises ConvBNSiLU layers, and the outputs of the fifth convolution fusion modules of the first backbone network module and the second backbone network module are added and then input into the SPPF layer; The image processing device comprises a ConvBNSiLU layer, a BottleneckCSP layer and an SPPF layer, wherein the ConvBNSiLU layer is used for carrying out feature extractio