CN-122024139-A - Monitoring method, monitoring system, training method, electronic device and computer storage medium

CN122024139ACN 122024139 ACN122024139 ACN 122024139ACN-122024139-A

Abstract

The application provides a monitoring method, a monitoring system, a training method, electronic equipment and a computer storage medium, and relates to the technical field of computer vision and image processing. The method comprises the steps of collecting image frames and event streams of a target scene, preprocessing the image frames and the event streams to obtain first image features and first event features, aggregating the first event features based on a plurality of preset time windows to generate fusion event features, conducting cross-modal attention calibration and spatial alignment on the first image features and the fusion event features to generate fusion features, and reconstructing and outputting an extreme visual image based on the fusion features and an estimated ambient light level according to the image frames and the event streams. The enhanced image with clear details and wider dynamic range is reconstructed under extreme conditions such as extremely dark and overexposure, the perception robustness and the target recognition accuracy of the monitoring system under complex illumination are improved, the processing efficiency is simultaneously considered, and all-weather monitoring can be realized.

Inventors

ZHANG ENZE
HU ZHIFA
TAN ZHIYUAN

Assignees

成都易瞳科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260128

Claims (12)

1. A method of monitoring, the method comprising: Collecting an image frame and an event stream of a target scene; Preprocessing the image frames and the event stream to obtain first image features and first event features; aggregating the first event features based on a plurality of preset time windows to generate fusion event features; Performing cross-modal attention calibration and spatial alignment on the first image feature and the fusion event feature to generate a fusion feature; and reconstructing and outputting an extreme visual image based on the fusion characteristics and the estimated ambient light level according to the image frames and the event stream.
2. The method of claim 1, wherein the aggregating the first event features based on a plurality of preset time windows to generate a fusion event feature comprises: Converting the first event feature into event frames of corresponding scales according to at least two different time windows; and acquiring the fusion weight of the spatial features of each scale through a gating network, and carrying out weighted fusion according to the fusion weight to generate the fusion event features.
3. The method of claim 2, wherein the at least two different time windows comprise: a first time window for capturing high-speed motion features and a second time window for capturing low-speed motion features; and, the second time window is greater than the first time window.
4. The method of claim 1, wherein the cross-modal attention calibration and spatial alignment of the first image feature and the fusion event feature to generate a fusion feature comprises: calculating a first channel attention weight and a first space attention weight aiming at the first image feature, and calibrating the fusion event feature based on the first channel attention weight and the first space attention weight to obtain a calibration event feature; Calculating a second channel attention weight and a second spatial attention weight aiming at the fusion event feature, and calibrating the first image feature based on the second channel attention weight and the second spatial attention weight to obtain a calibrated image feature; And splicing and fusing the calibration image features and the calibration event features aligned by the space transformation network to generate the fusion features.
5. The method of claim 1, wherein reconstructing the output extreme visual image based on the fusion features and the estimated ambient light level from the image frames and event streams comprises: estimating the ambient light level based on the average brightness of the image frames and the event rate of the event stream; selecting corresponding normalization parameters from a plurality of preset normalization parameter sets according to the ambient light level, and carrying out conditional normalization processing on the fusion characteristics; And carrying out up-sampling and activation on the features subjected to the condition normalization processing, and reconstructing to obtain the extreme visual image.
6. The monitoring system is characterized by comprising an acquisition module, a preprocessing module, an aggregation module, a calibration module and a reconstruction module; the acquisition module is used for acquiring image frames and event streams of a target scene; The preprocessing module is used for preprocessing the image frames and the event stream to obtain first image features and first event features; the aggregation module is used for aggregating the first event features based on a plurality of preset time windows to generate fusion event features; The calibration module is used for performing cross-modal attention calibration and spatial alignment on the first image feature and the fusion event feature to generate a fusion feature; the reconstruction module is used for reconstructing and outputting an extreme visual image based on the fusion characteristics and an ambient light level estimated according to the image frames and the event stream.
7. A training method for a monitoring system, the training method comprising: Acquiring a training data set, wherein the training data set comprises a plurality of groups of time-space aligned image frame-event stream sample pairs; constructing and initializing a neural network model to be trained based on the process flow of the monitoring method of any one of claims 1 to 5, and training the neural network model in stages; wherein the phased training comprises at least two phases.
8. The training method of claim 7, wherein the phased training comprises: Training the neural network model using a first loss function in a first training stage; And in a second training stage, training the neural network model trained by the first training stage by using a second loss function.
9. The training method of claim 8, wherein in a third training phase, the neural network model trained in the second training phase is fine-tuned using a specific subset of scenes in the training dataset and a third loss function.
10. Training method according to claim 9, wherein the specific scene subset comprises samples with illuminance below a first threshold and/or samples with dynamic range above a second threshold.
11. An electronic device comprising a memory and a processor, the memory having stored therein program instructions which, when executed by the processor, perform the steps of the method of any of claims 1 to 5 or 7.
12. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein computer program instructions which, when executed by a processor, perform the steps of the method of any of claims 1 to 5 or 7.

Description

Monitoring method, monitoring system, training method, electronic device and computer storage medium Technical Field The application relates to the technical field of computer vision and image processing, in particular to a monitoring method, a monitoring system, a training method, electronic equipment and a computer storage medium. Background The traditional RGB camera has significant technical bottlenecks in solving the visual perception problem in the extreme dynamic range scene. Specifically, in low illumination (e.g., less than 1 lux) environments, the exposure time must be extended to obtain an image with sufficient signal-to-noise ratio, but this inevitably introduces motion blur, severely limiting the ability to capture and analyze fast moving objects. At the same time, in the face of direct light (brightness is reachable) such as in a night-time vehicleLux) and road darkness (brightness can be as low as 0.01 lux), the limited inherent dynamic range of conventional cameras (usually just) The imaging is extremely easy to be subjected to local overexposure (information loss of a strong light area) or overall underexposure (details of a dark part are annihilated), and the severe requirements of the application such as automatic driving, security monitoring and the like on the simultaneous clear and bright and dark extreme areas can not be met. While High Dynamic Range (HDR) synthesis techniques can alleviate this problem to some extent through multi-exposure fusion, they are high in processing delay and extremely sensitive to relative motion between the scene and the camera, with limited utility in dynamic scenes. The event camera is used as an asynchronous sensor inspired by biological vision, and provides new possibility for breaking through the bottleneck. It responds only to asynchronous changes in pixel brightness, naturally has a high dynamic range (over 120 dB) and extremely high temporal resolution, and can capture high-speed motion without ambiguity. However, event cameras also have inherent limitations in that they have sparse or even missing output information due to lack of brightness variation in static or flat-textured areas, while they do not directly provide absolute intensity and color information. More importantly, under the condition of extremely low illumination, event data is limited by photon shot noise, a large number of noise events are generated, a time tailing phenomenon can be formed due to weak and continuous brightness change, and the spatial non-uniformity of the sensor brings serious challenges to subsequent processing. Therefore, it is difficult to reconstruct a high quality, high semantic level static scene image from single dependent event data. In the prior art, a fusion method combining an RGB image and event data to make up for the shortfall has become a research trend, but most methods have not fully and efficiently mined the depth complementary characteristics of two modes. Early or simple fusion strategies, such as feature stitching directly in channel dimension or result fusion in decision layer, fail to realize fine granularity alignment and mutual enhancement of cross-modal information, and have limited fusion effect. Some advanced architectures, such as fusion methods based on Feature Pyramid Networks (FPN), can realize multi-scale feature interaction, but have large model parameters and complex calculation, and are difficult to be deployed on edge devices (such as vehicle-mounted computing units and unmanned aerial vehicle-mounted computers) with extremely high requirements on power consumption and real-time performance. Although the method based on the bidirectional attention mechanism (such as bidirectional calibration BDC) can improve the fusion quality, the calculation cost is still larger, and the optimization is not performed specifically for the troublesome problems of noise, tailing, non-uniformity and the like in the scene with the extreme dynamic range, so that the generalization capability and the robustness are insufficient. Disclosure of Invention Accordingly, an objective of the embodiments of the present application is to provide a monitoring method, a monitoring system, a training method, an electronic device, and a computer storage medium, so as to improve the above-mentioned problems in the prior art. In a first aspect, an embodiment of the present application provides a monitoring method, where the method includes collecting an image frame and an event stream of a target scene, preprocessing the image frame and the event stream to obtain a first image feature and a first event feature, aggregating the first event feature based on a plurality of preset time windows to generate a fusion event feature, performing cross-modal attention calibration and spatial alignment on the first image feature and the fusion event feature to generate a fusion feature, and reconstructing an output extreme visual image based on the fusion feature and an ambien