CN-122024142-A - Video instance segmentation method and system based on track-guided memory network

CN122024142ACN 122024142 ACN122024142 ACN 122024142ACN-122024142-A

Abstract

The invention discloses a video instance segmentation method and a video instance segmentation system based on a track-guided memory network, and relates to the technical field of computer vision. The method designs a track-appearance joint modeling module, introduces a learnable threshold gating layer in a time sequence state modeling branch to filter a noise track, and fuses track features and appearance features through a mask self-attention mechanism, so as to generate a robust instance query. On the basis, a time sequence perception memory updating and managing module is designed, and the historical characteristics are denoised through a channel-space collaborative filtering mechanism. And dynamically updating a memory pool with a fixed size based on the appearance similarity, maintaining the characteristic quality of the memory pool, and introducing time sequence position codes to endow historical characteristic time sequence sensing capability. The method and the device realize high-precision segmentation and cross-frame identity association of the video instance in the complex scene, and effectively relieve the problem of instance identity association errors caused by lack of motion trail modeling and memory noise accumulation.

Inventors

Gui yan
LIU ZHUO

Assignees

长沙理工大学

Dates

Publication Date: 20260512
Application Date: 20260313

Claims (7)

1. A method and system for video instance segmentation based on a track-guided memory network, the method being performed by a computer and comprising: S1, acquiring a video instance segmentation dataset and a static image dataset, and forming an image segmentation tag pair by each image in the dataset and a corresponding segmentation tag; S2, sampling a video data set to generate a real video fragment, and independently applying random data enhancement to each frame in a sequence by copying a single static image into an image sequence to simulate inter-frame motion and appearance change, expanding the time sequence of the static image data set into a pseudo video fragment, and constructing a mixed training data set containing the real video fragment and the pseudo video fragment; S3, constructing a segmentation model, wherein the segmentation model comprises a feature extraction module, a track appearance joint modeling module, a time sequence perception memory updating and managing module and a prediction head; S4, designing a joint loss function comprising classification loss, mask prediction loss and feature similarity loss; s5, calculating a loss function designed by the S4 by using the mixed training data set constructed by the S2, and training the segmentation model constructed by the S3 by using a back propagation algorithm, and S6, outputting a target mask with cross-frame identity association by using the segmentation model obtained by training in the S5.
2. The method for video instance segmentation based on a track-guided memory network as set forth in claim 1, wherein the specific implementation process of S2 is as follows: S201, performing pseudo-video sequence generation processing on each image-segmentation tag pair in a static image-segmentation tag pair set, copying a single Zhang Jingtai image into a continuous image sequence, and independently applying random data enhancement operation to each frame in the sequence, wherein the random data enhancement operation comprises random scaling, random horizontal overturn, random color dithering and random gray level change, and generating a pseudo-video data set containing inter-frame motion and appearance change; S202, performing video sampling processing on a video-segmentation tag pair set, sampling continuous frames from the same video sequence according to a preset sampling interval, and performing normalization, clipping and affine transformation processing on each frame of image obtained by sampling and a corresponding segmentation tag to obtain a real video data set; s203, mixing the pseudo video data set and the real video data set according to a preset proportion, and constructing a mixed training data set.
3. The method for partitioning a video instance based on a trace-guided memory network as set forth in claim 1, wherein said trace-appearance joint modeling module in S3 comprises a temporal state modeling branch and a spatial appearance modeling branch, and specifically comprises: The time sequence state modeling branch consists of a plurality of stacked attention blocks, wherein a cross attention layer, a learnable threshold gating layer and a self-attention layer are sequentially connected inside each attention block, track tokens of a frame above the branch are used as input, soft masks are generated through the cross attention layer to filter noise track features through the learnable threshold gating layer, feature association of motion tracks between different instances is modeled through the self-attention layer, the space appearance modeling branch consists of a plurality of stacked attention blocks, the cross attention layer and the mask self-attention layer are sequentially connected inside each attention block, appearance queries of a frame above the branch are used as input, appearance features of a current frame are aggregated through the cross attention layer, track features output by the time sequence state modeling branch are spliced with appearance features output by the space appearance modeling branch, feature fusion is performed in a high confidence associated region guided by the soft masks through a mask self-attention mechanism, and enhanced instance feature representation is generated.
4. The method for video instance segmentation based on a trajectory-guided memory network of claim 1, wherein the specific processing of the learnable threshold gating layer by the trajectory-appearance joint modeling module comprises: Calculating a space-time correlation matrix between the middle track token and the frame inquiry, carrying out average aggregation on the space-time correlation matrix to obtain a confidence coefficient of each track token, scaling the difference between the confidence coefficient and a learnable threshold value through a temperature coefficient, mapping the difference into a soft mask through a Sigmoid function, and carrying out weighted fusion on the middle track token of the current frame and the track token of the previous frame by utilizing the soft mask to obtain the track token after denoising.
5. The method for partitioning video instances based on a trace-guided memory network as set forth in claim 1, wherein said timing-aware memory updating and management module in S3 comprises a channel-space collaborative filtering mechanism, a memory updating strategy based on appearance similarity, and a timing position code, and specifically comprises: The channel-space collaborative filtering mechanism comprises a channel attention submodule and a space attention submodule, wherein the channel attention submodule utilizes a maximum pooling layer, an average pooling layer and a multi-layer perceptron to extract cross-channel semantic distribution and generate a channel weight vector, the space attention submodule utilizes a multi-scale depth separable convolution layer to extract space details and generate a space weight vector, example features are weighted sequentially through the channel weight vector and the space weight vector to realize collaborative denoising of channel and space dimensions, the memory updating strategy based on appearance similarity maintains a memory pool with fixed size, the denoised example features are directly stored when the memory pool is not full, cosine similarity of current frame features and all historical features in the memory pool is calculated when the memory pool is full, features at the lowest positions of similarity are replaced, and the time sequence position coding mechanism generates time sequence position codes through the multi-layer perceptron according to feature offset between the current frame features and the historical features in the memory pool and stacks the time sequence position codes to the memory features to enhance time sequence perceptibility.
6. The method for video instance segmentation based on a track-guided memory network of claim 1, wherein the constructing the joint loss function in S4 comprises: The method comprises the steps of carrying out binary matching on a prediction set and a real label set by using a Hungary algorithm to determine an optimal matching pair, calculating a global optimization objective function consisting of classification loss, feature similarity loss and mask segmentation loss weighted summation based on the optimal matching pair, wherein the classification loss is calculated by adopting a cross entropy loss function, the mask segmentation loss consists of binary cross entropy loss and Dice loss, the feature similarity loss adopts a contrast learning mechanism, and feature similarity constraint characteristic time sequence consistency of a homologous instance in different time steps is calculated.
7. The video instance segmentation system based on the track-guided memory network is characterized by comprising the following modules: The video sequence input module is used for acquiring a video sequence to be segmented and transmitting the video frame sequence to a segmentation model, wherein the segmentation model detects and segments an interested target in the video without presetting an instance segmentation label; the model training module is used for constructing a mixed training data set containing a pseudo video segment and a real video segment, and training a segmentation model by utilizing the mixed training data set, wherein the segmentation model comprises a feature extraction module, a track-appearance joint modeling module, a time sequence perception memory updating and managing module and a prediction head, and the training process optimizes model parameters by minimizing joint objective functions containing classification loss, mask prediction loss and feature similarity loss; The video instance segmentation module is used for executing instance detection and segmentation on the first frame of the video sequence and initializing a memory pool by adopting an online processing mode in a model reasoning stage, inputting the characteristics of the current frame into a trained segmentation model from the second frame, transmitting a track token of the previous frame to the current frame by utilizing the track-appearance joint modeling module to guide feature aggregation, and utilizing the time sequence perception memory updating and management module to search space-time prior information in the memory pool to assist in instance segmentation and identity association of the current frame and output segmentation masks and track IDs of all instances in the whole video sequence.

Description

Video instance segmentation method and system based on track-guided memory network Technical Field The invention relates to the technical field of computer vision and video analysis, in particular to a method and a system for realizing video instance segmentation in a complex scene through track modeling and memory management. Background Video instance segmentation (Video Instance Segmentation, VIS) is a fundamental and central task in the field of computer vision and video analysis, with the goal of simultaneously achieving pixel-level detection, segmentation, and identity correlation across frames for an instance of interest in a video sequence (Identity Association). Unlike still image segmentation, VIS must maintain identity consistency of examples in complex spatio-temporal dynamics (Temporal Consistency), which requires algorithms that not only have fine segmentation capability within a single frame, but also effectively solve the problems of motion blur, deformation, occlusion, target re-recognition, and the like in video, and has wide application prospects in the fields of video understanding, autopilot, video editing, augmented reality, and the like. In the technical field of video post-processing and visual content generation, a video instance segmentation technology can accurately define the boundary of a dynamic foreground object by generating a pixel level instance mask with time sequence consistency, so that the purposes of fine foreground and background separation and high-quality video special effect synthesis are achieved. For example, in the streaming media interactive application, through performing pixel-level real-time segmentation and track tracking on the foreground character, content-aware rendering (Content-AWARE RENDERING) such as anti-shielding barrage technology can be realized, so that a text layer is automatically rendered behind a foreground instance image layer, and the interactive experience is enhanced on the premise of not interfering with the visual information of a main body. In special effect production, the VIS can support the refinement operation of independent examples in the video, such as video patching (Video Inpainting), object removal or virtual environment synthesis, and remarkably improve the automation level and quality of video content production. In recent years, with the development of deep learning technology, a transform-based online video instance segmentation method has been significantly advanced. Among other things, query propagation-based methods attempt to maintain timing consistency by enhancing query features. For example, lee S et al propose CAVIS method （Lee, S., Seo, J., Han, K., Choi, M., Im, S. Cavis: Context-aware video instance segmentation. arXiv: 2407.03010, 2025）, to generate a perceptual representation in conjunction with object boundary contexts. However, the above techniques rely primarily on spatial appearance characteristics (SPATIAL APPEARANCE Features) of the examples for cross-frame matching. Under complex scenes of rapid movement, deformation or mutual shielding of targets, identity consistency is difficult to maintain only by appearance characteristics. The prior art ignores the key role of time sequence Motion trajectories (Trajectory/Motion Patterns) in distinguishing similar targets, and the lack of effective modeling of Motion state information results in insufficient robustness of identity association in complex scenes. To address timing dependency problems in long video sequences, memory network-based methods have been introduced to store historical information. For example, heo M et al propose GenVIS method （Heo, M., Hwang, S., Hyun, J., Kim, H., Oh, S.W., Lee, J.Y., Kim, S.J. A generalized framework for video instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023: 14623-14632）, to use a memory pool to store historical instance features to enhance the temporal representation of queries. Kimh et al propose VISAGE method （Kim, H., Kang, J., Heo, M., Hwang, S., Oh, S.W., Kim, S.J. VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement. In Computer Vision – ECCV 2024, Springer Nature Switzerland, Cham, 2025: 93-109）, to store instance-level appearance features and object embedding using two separate memory pools. Although these memory network-based approaches improve long-term association to some extent, their memory updating and management mechanisms are inefficient. These methods typically employ a simple storage strategy and give the same weight to all historical frames when reading the memory. The processing mode is not only easy to introduce background noise or invalid features generated by shielding, so that the feature discrimination capability is reduced, but also the time sequence perception capability of the historical information is lacked, the contribution degree of the historical information cannot be adaptively adjusted according