CN-121999304-A - Method and device for predicting collision time based on binocular event camera

CN121999304ACN 121999304 ACN121999304 ACN 121999304ACN-121999304-A

Abstract

The application provides a collision time prediction method and device based on a binocular event camera, and relates to the technical field of computers. The method comprises the steps of performing supervised training on an RGB teacher network to be trained based on an RGB data set to obtain a trained RGB teacher network, determining a scene depth predicted value and a two-dimensional optical flow predicted value based on the RGB teacher network and the RGB data set, generating a sample pair by using an event camera response model and the RGB data set, determining a depth predicted value and an optical flow predicted value according to the sample pair, the scene depth predicted value and the two-dimensional optical flow predicted value by using a simulation event network, determining a corresponding depth map and an optical flow field according to real-time event streams acquired by a binocular event camera and a trained target event network, and determining collision time based on a TTC calculation model, the depth map and the optical flow field. The collision risk sensing method and the collision risk sensing device provide high timeliness and high robustness for scenes such as automatic driving, unmanned aerial vehicles, mobile robots and the like.

Inventors

ZHANG YUNJIAN
ZHU YAO
WANG ZIZHE
LIN YU

Assignees

启元实验室

Dates

Publication Date: 20260508
Application Date: 20260408

Claims (10)

1. A binocular event camera based collision time prediction method, comprising: Performing supervised training on an RGB teacher network to be trained based on an RGB data set subjected to true value labeling in advance to obtain a trained RGB teacher network, and determining a scene depth predicted value and a two-dimensional optical flow predicted value of an RGB domain of the RGB data set based on the trained RGB teacher network and the RGB data set; generating a sample pair of an RGB image and a simulation event tensor by using a preset event camera response model and the RGB data set; determining a depth predicted value and an optical flow predicted value of an event domain according to the sample pair, the scene depth predicted value and the two-dimensional optical flow predicted value by using a simulation event network; And under the condition that the scene depth predicted value and the depth predicted value, the two-dimensional optical flow predicted value and the optical flow predicted value meet preset estimation capacity judging conditions, determining a corresponding depth map and an optical flow field according to a real-time event flow acquired by the binocular event camera and a trained target event network, and determining the collision time of the binocular event camera based on a preset TTC calculation model, the depth map and the optical flow field, wherein the structure of the target event network is the same as that of the simulated event network.
2. The method of claim 1, wherein the RGB teacher network to be trained comprises a feature extractor, a depth solution stage, and an optical flow solution stage; The method for monitoring and training the RGB teacher network to be trained based on the RGB data set subjected to true value labeling in advance to obtain a trained RGB teacher network, and determining a scene depth predicted value and a two-dimensional optical flow predicted value of an RGB domain of the RGB data set based on the trained RGB teacher network and the RGB data set comprises the following steps: Performing supervised training on the RGB teacher network to be trained based on the RGB data set subjected to true value labeling in advance to obtain the trained RGB teacher network; selecting a left eye image and a right eye image corresponding to each time step and the next time of the time step from the RGB data set, and taking the left eye image and the right eye image as complete input samples of the RGB teacher network; performing multi-scale convolution feature extraction on the complete input sample by using the feature extractor to determine coding features corresponding to the complete input sample; On the characteristic that the coding characteristic is the highest layer, a depth cost cube and an optical flow cost cube are constructed through left and right eye characteristics, and a preset 3D convolutional neural network is utilized to conduct characteristic aggregation and regularization processing on the depth cost cube and the optical flow cost cube in a three-dimensional space so as to generate parallax and spatial joint characteristics and optical flow related characteristics; And determining the scene depth predicted value and the two-dimensional optical flow predicted value based on an end-to-end differentiable parallax estimation algorithm, the joint feature and the optical flow related feature by using the depth solution terminal and the optical flow solution terminal.
3. The method of claim 2, wherein the determining the scene depth predictor and the two-dimensional optical flow predictor based on an end-to-end differentiable disparity estimation algorithm, the joint feature, and the optical flow related feature using the depth solution stage and the optical flow solution stage comprises: Determining a low-resolution disparity map and a low-resolution optical flow estimation result based on an end-to-end differentiable disparity estimation algorithm, the joint feature and the optical flow related feature; splicing the low-resolution disparity map and the coding features, and inputting the spliced features to the depth solution wharf to determine the scene depth predicted value; and splicing the low-resolution optical flow estimation result with the optical flow related features, and inputting the spliced features to the optical flow solution terminal to output a two-dimensional optical flow predicted value.
4. The method of claim 3, wherein the end-to-end differentiable disparity estimation algorithm comprises a normalized exponential transformation operation, a disparity regression operation, a multi-channel three-dimensional convolution operation, a channel mapping convolution operation; Wherein the determining a low resolution disparity map and a low resolution optical flow estimation result based on the end-to-end differentiable disparity estimation algorithm, the joint feature, and the optical flow correlation feature comprises: Processing the joint features based on the normalized exponential transformation operation and the disparity regression operation to determine the low resolution disparity map; Processing the optical flow correlation features based on the multi-channel three-dimensional convolution operation and the channel mapping convolution operation to determine the low resolution optical flow estimation result.
5. The method of claim 1, wherein determining depth predictors and optical flow predictors for an event domain from the pair of samples, the scene depth predictor and the two-dimensional optical flow predictor using a network of simulated events comprises: Based on the sample pair, selecting a simulation event tensor of the RGB image corresponding to the scene depth predicted value and the two-dimensional optical flow predicted value; The simulated event tensor is input to the simulated event network to output the depth prediction value and the optical flow prediction value.
6. The method of claim 1, wherein the determining the corresponding depth map and optical flow field from the real-time event stream and the trained target event network acquired by the binocular event camera and determining the collision time of the binocular event camera based on a preset TTC calculation model, the depth map and the optical flow field if the scene depth prediction value and the depth prediction value, the two-dimensional optical flow prediction value and the optical flow prediction value satisfy a preset estimated capability determination condition comprises: Dividing and encoding the real-time event stream according to a preset fixed time window under the condition that the scene depth predicted value and the depth predicted value, the two-dimensional optical flow predicted value and the optical flow predicted value meet preset estimation capacity judging conditions so as to generate an event tensor sequence; Inputting each frame event tensor in the sequence of event tensors into the target event network to output the depth map and the optical flow field; The depth map and the optical flow field are input to the TTC calculation model to output a time of collision of the binocular event camera.
7. The method as recited in claim 1, further comprising: Forward reasoning is carried out on a real event sample in the real event data acquired by the binocular event camera by utilizing the simulated event network so as to output a depth optical flow pseudo tag on a real event domain; inputting the real event samples to the target event network to output corresponding sample depths and sample optical flows; determining difference information by comparing the depth-optical-flow pseudo tag, the sample depth, and the sample optical flow; and constructing cross-domain distillation loss based on the difference information, and updating parameters of the target event network.
8. A binocular event camera based collision time prediction apparatus, comprising: The primary prediction module is used for performing supervised training on an RGB teacher network to be trained based on an RGB data set subjected to true value labeling in advance to obtain a trained RGB teacher network, and determining a scene depth predicted value and a two-dimensional optical flow predicted value of an RGB domain of the RGB data set based on the trained RGB teacher network and the RGB data set; The sample pair generating module is used for generating a sample pair of an RGB image and a simulation event tensor by utilizing a preset event camera response model and the RGB data set; the event prediction module is used for determining a depth predicted value and an optical flow predicted value of an event domain according to the sample pair, the scene depth predicted value and the two-dimensional optical flow predicted value by using a simulated event network; The collision time output module is used for determining a corresponding depth map and an optical flow field according to a real-time event stream acquired by the binocular event camera and a trained target event network under the condition that the scene depth predicted value and the depth predicted value, the two-dimensional optical flow predicted value and the optical flow predicted value meet preset estimation capacity judging conditions, and determining the collision time of the binocular event camera based on a preset TTC calculation model, the depth map and the optical flow field, wherein the structure of the target event network is the same as that of the simulated event network.
9. An electronic device, comprising: A processor; a memory storing a computer program which, when executed by the processor, causes the processor to perform the method of any of claims 1-7.
10. A non-transitory computer-readable storage medium having stored thereon computer-readable instructions that, when executed by a processor, cause the processor to perform the method of any of claims 1-7.

Description

Method and device for predicting collision time based on binocular event camera Technical Field The application relates to the technical field of computers, in particular to a collision time prediction method and device based on a binocular event camera. Background In autonomous motion systems such as automatic driving, intelligent robots, unmanned aerial vehicles and the like, timely and reliably sensing potential collision risks is one of the key links for guaranteeing safety. In practice, the time to collision (Time to Collision, TTC) is typically used to characterize the remaining time required for the object to collide with the body in the current state of motion. The TTC not only can be used as a trigger condition for emergency braking and obstacle avoidance decision, but also is an important input for upper layer planning and control strategy design. If TTC can be estimated continuously and stably in a complex dynamic scene, the system can realize various active safety behaviors such as early deceleration, lane changing or detouring, and the like, thereby obviously reducing the risks of traffic accidents and equipment damage. Conventional TTC estimation is based on depth estimation and optical flow estimation of a frame camera, in which, on the one hand, the three-dimensional geometry of a scene is recovered from a sequence of successive images, and on the other hand, the motion field at the pixel level is estimated, and then the TTC of each object is estimated based on geometric constraints. However, the conventional visual perception technology based on the traditional frame imaging paradigm still faces significant challenges, namely, firstly, the sampling frequency of the conventional frame camera is limited by a fixed frame rate mechanism, detail changes in the high-speed motion process are difficult to capture in time, and in addition, due to the limited dynamic range of the conventional frame camera, serious overexposure and underexposure phenomena are easy to generate in strong light and high-contrast environments, so that scene structures and motion information are difficult to reliably recover. In combination with the above factors, under complex scenes such as high-speed motion, strong light change or high dynamic range, the stability and precision of depth and optical flow estimation based on a frame camera are severely restricted, so that the practical application requirements on high safety and low delay are difficult to meet. The event camera, also called dynamic vision sensor, is a novel biological heuristic vision sensor, and different from the traditional camera which outputs complete frame images according to fixed frequency, each pixel in the event camera works independently, when the light intensity change at the pixel exceeds a preset threshold value, an event is triggered, and information comprising spatial position, time stamp and polarity is output. Event cameras typically have microsecond time resolution, very high dynamic range, and very low perceived latency, and can continuously provide stable temporal and spatial information in complex scenes such as high speed motion, high illumination variation, and high contrast. the robustness and the instantaneity of the depth and optical flow estimation under the scene can be obviously improved by carrying out geometric perception and motion estimation based on the data of the event camera, so that the method is particularly suitable for safety key tasks with extremely high requirements on response speed and reliability, such as collision time prediction. As event camera hardware matures, academia and industry have conducted a great deal of research around the geometric perception of event data, both in the depth estimation and in the optical flow estimation directions. In terms of depth estimation, the event depth estimation usually aggregates discrete events in time and inputs the aggregated events into a depth network, or performs multi-modal training in combination with frame images to recover scene structure only depending on event data in an inference stage, while in terms of optical flow estimation, the event optical flow estimation mostly utilizes high time resolution of the event flow to model pixel-level motion fields in a short time window, thereby obtaining more stable motion estimation under high-speed motion and complex illumination conditions. In general, these methods demonstrate that event cameras can provide high quality three-dimensional geometric information and pixel-level motion information in difficult scenes, laying an important foundation for TTC estimation based on depth and optical flow joint reasoning. In the related art, most of depth estimation and optical flow estimation methods based on event cameras adopt a supervision learning framework, and seriously depend on a large amount of high-quality event depth or optical flow labels, and the acquisition and labeling cost of event data in a real sce