CN-122023462-A - Target tracking method, system and medium based on image-event stream

CN122023462ACN 122023462 ACN122023462 ACN 122023462ACN-122023462-A

Abstract

The invention relates to a target tracking method, a system and a medium based on an image-event stream, wherein the method comprises the steps of obtaining an RGB image and an original event stream acquired by an event camera, carrying out heterogeneous encoding on the original event stream, synchronously constructing a high-resolution two-dimensional space image event representation and a low-resolution three-dimensional voxel space image event representation, inputting the RGB image into a single-stream tracking backbone network based on a visual transducer, extracting RGB features, carrying out block embedding conversion on the high-resolution two-dimensional space image event representation into an initial event representation, inputting the low-resolution three-dimensional voxel space image event representation into a sparse convolution network, extracting sparse voxel motion features, carrying out dynamic evolution on the initial event representation based on the sparse voxel motion features, generating a dynamic representation with time sequence perception capability, carrying out multi-mode feature fusion on the dynamic representation and the RGB features, and determining a target tracking position based on the fused features. The invention realizes high-efficiency RGBE feature fusion and target positioning.

Inventors

LU YAO
XU RUIHUI
Zheng tianqi
ZHOU QI
LU GUANGMING

Assignees

哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院)

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (10)

1. A method of image-event stream based object tracking, the method comprising the steps of: step S10, multi-modal data is acquired, wherein the multi-modal data comprises RGB images and original event streams acquired by an event camera; Step S20, carrying out heterogeneous encoding on the original event stream, and synchronously constructing a high-resolution two-dimensional space image event representation and a low-resolution three-dimensional voxel space image event representation to obtain event representations complementary in two information dimensions; Step S30, extracting multi-mode features, wherein the step S30 specifically comprises the following steps: step S301, inputting the RGB image into a single-flow tracking backbone network based on a visual transducer, and extracting RGB features; Step S302, carrying out block embedding and conversion on the high-resolution two-dimensional space image event characterization into an initial event prompt; Step S303, inputting the low-resolution three-dimensional voxel space image event representation into a sparse convolution network, and extracting sparse voxel motion characteristics; step S40, dynamically evolving the initial event prompt based on the sparse voxel motion characteristics to generate a dynamic prompt with time sequence perception capability; s50, carrying out multi-mode feature fusion on the dynamic prompt and the RGB features; And step S60, performing target center positioning, center offset regression and scale prediction regression based on the fused features, outputting a tracking boundary box of the target, and determining a target tracking position.
2. The image-event stream based object tracking method according to claim 1, wherein the step S20 comprises: step S201, clipping the RGB image to obtain a template image and a search area image, and intercepting an event stream fragment aligned with an RGB exposure time window for the original event stream; in step S202, the truncated event stream segment is coded and converted into a two-dimensional image space with 256×256 resolution and a three-dimensional voxel space with 8×16×16 resolution, so as to obtain two event representations complementary in information dimension.
3. The image-event stream based object tracking method according to claim 2, wherein the step S301 comprises: Step S3012, splicing the template image obtained by cutting and the search area image to be regarded as a unified input sequence; Step S3013, the spliced RGB image sequence is segmented into blocks with fixed sizes, the blocks are linearly projected into a Token sequence, and position codes are added to reserve space structure information.
4. The image-event stream based object tracking method according to claim 3, wherein the step S40 comprises: Step S401, heterogeneous feature alignment, wherein the sparse voxel motion features are mapped to a semantic space compatible with the initial event prompt; step S402, scene self-adaptive gating fusion, namely self-adaptively calculating fusion weights according to dynamic and static characteristics of the scene; in step S403, the semantic space is normalized to eliminate modality differences to ensure seamless injection into the RGB backbone.
5. The image-event stream based object tracking method according to claim 4, wherein the step S50 comprises: Key information screening is completed through space convexity attention focusing key things and dimension reduction of a bottleneck layer, high-frequency motion edge information of an event mode is integrated into RGB feature streams, and information loss of RGB under motion blur or low illumination is effectively compensated.
6. The image-event stream based object tracking method according to claim 5, wherein the step of performing object centering in step S60 comprises: step S601, compressing input features step by step through a progressive convolutional decoding unit, gradually extracting center positioning semantics through five-level nonlinear transformation to obtain normalized confidence coefficient thermodynamic diagrams, and representing spatial probability distribution of targets after the thermodynamic diagram normalized activation function processing, wherein peak positions correspond to predicted target centers.
7. The image-event stream based object tracking method according to claim 6, wherein the step of center shift regression in step S60 comprises: Step S602, adopting a convolution path parallel to the step S601, and adopting the same progressive decoding strategy to output a two-dimensional sub-pixel offset vector, wherein the offset vector is used for compensating quantization errors of a feature map grid, and the positioning accuracy is increased from a feature step level to a sub-pixel level.
8. The image-event stream based object tracking method according to claim 7, wherein the step of scale predictive regression in step S60 comprises: Step S603, directly regressing the normalized width and height of the target frame, limiting the output range to the [0, 1] interval through the constraint function, and ensuring the physical feasibility of the prediction scale.
9. An image-event stream based object tracking system comprising a memory, a processor and an image-event stream based object tracking program stored on the processor, which image-event stream based object tracking program, when executed by the processor, performs the steps of the method according to any of claims 1 to 8.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores an image-event stream based object tracking program which, when executed by a processor, performs the steps of the method according to any of claims 1 to 8.

Description

Target tracking method, system and medium based on image-event stream Technical Field The invention relates to the technical fields of computer vision, deep learning and event stream data processing, in particular to a target tracking method, a target tracking system and a target tracking medium based on image-event streams. Background Target tracking is a basic task in the field of computer vision, and aims to continuously estimate the position and the scale of a target in a subsequent frame under the condition that the initial position of the target of a first frame is given. The RGB tracking method based on deep learning has been significantly advanced, and the tracker adopting Convolutional Neural Network (CNN) or Vision Transformer as the backbone network is widely applied to the fields of intelligent security, automatic driving, robot navigation and the like. However, the conventional frame camera is prone to degradation problems such as motion blur, overexposure or underexposure in high-speed motion, extreme illumination or high dynamic range scenes, resulting in reduced tracking robustness. The event camera (EVENT CAMERA) is used as a neuromorphic sensor, records the brightness change at the pixel level in an asynchronous mode, has microsecond time resolution, high dynamic range, low delay and other characteristics, and can provide stable movement edge information under the degradation scene. Since Event cameras are insensitive to static textures and RGB cameras are sensitive to motion blur, both have natural complementarity, image-Event (RGB-Event) object tracking schemes have emerged in the prior art that fuse Event data with RGB images. Such schemes typically transform the event stream into a specific representation (e.g., event image, event voxel, etc.), extract the event features and fuse with the RGB features in order to obtain a more robust representation of the target. Although the RGBE tracking scheme improves the tracking performance in a degraded scene to a certain extent, the existing method still has significant technical defects in terms of event representation construction, feature extraction and multi-mode fusion due to the inherent characteristics of high sparsity, unstructured and strong timeliness of event stream data, and the method mainly comprises the following three aspects: 1. single event representation forms are difficult to consider timing information retention and RGB feature space alignment Existing RGBE tracking schemes typically employ a single event representation, with trade-offs between timing information retention and spatial alignment. Specifically, the first type of scheme compresses an event stream into a representation form such as a two-dimensional event image (EVENT IMAGE) or a Time plane (Time Surface). Such two-dimensional representations are conveniently aligned in a spatial dimension with the RGB feature map, facilitating access to a mature RGB tracking framework. However, the compression operation on the time axis may lose fine-grained timing dynamic information of event data, resulting in limited perceptibility of the system to the target motion trajectory and reduced tracking stability in a high-speed motion or rapid appearance change scene. The second type of scheme uses three-dimensional representations such as Event Voxel Grid (Event Voxel Grid) to preserve spatio-temporal dynamic information and extracts motion features through three-dimensional convolution or transformer structures. However, there are significant differences in resolution, feature morphology and tensor organization form between three-dimensional representation and two-dimensional RGB features, and the engineering complexity of fusion alignment is high and feature mismatch noise is easily introduced. In addition, if the complete spatial resolution (H\times W) is maintained, the calculated amount and the memory occupation increase obviously along with the increase of the time dimension T, which is unfavorable for real-time deployment, and if the spatial resolution is reduced to control the calculation cost, the loss of spatial information and the reduction of the fineness of event features are caused. The above analysis shows that technical routes relying on single event representations present a significant challenge in meeting both timing information integrity and spatial alignment feasibility. 2. Event feature extraction method fails to effectively utilize data sparsity characteristics The event stream data output by the event camera has the inherent characteristics of high sparsity, non-uniform space-time distribution and asynchronous triggering. No matter what event representation form is adopted (event voxels, event images or other forms), only part of the space-time positions in the corresponding tensor contain effective event information, and the event distribution is very uneven. However, the existing event feature extraction method generally adopts a traditional calculatio