CN-122023797-A - Time sequence consistency modeling method oriented to dynamic change scene

CN122023797ACN 122023797 ACN122023797 ACN 122023797ACN-122023797-A

Abstract

The invention discloses a time sequence consistency modeling method for a dynamic change scene, and belongs to the technical field of computer vision and video processing. The method comprises the steps of dynamically capturing semantic similarity between a front frame and a rear frame through a learnable convolution kernel, constructing a feature propagation network based on depth feature flow, carrying out nonlinear mapping and self-adaptive weighting on semantic features of a front frame and current frame features, calculating local semantic similarity through corresponding shallow guiding features of the front frame and the rear frame by utilizing the deep semantic features of the front frame and the rear frame, generating a learnable space self-adaptive interpolation weight, carrying out structural alignment and fusion on adjacent frame features, sequentially carrying out target transformation, target detection and target matching, and calculating cross-over ratio IoU between target frames under a common view angle, so that matching deviation caused by view angle difference can be remarkably reduced, and target association is more regular and geometrically consistent. The invention solves the problems of video semantic prediction flickering, artifacts, texture slippage and the like caused by severe camera movement, non-rigid deformation of objects and frequent shielding.

Inventors

ZHAO WENBO
LIU JINLONG
SONG QINGLIANG
Mayla Umair

Assignees

哈尔滨工业大学

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (10)

1. The time sequence consistency modeling method facing to the dynamic change scene is characterized by comprising the following steps of: Firstly, dynamically capturing semantic similarity between a front frame and a rear frame through a learnable convolution kernel, constructing a feature propagation network based on a depth feature stream, carrying out nonlinear mapping and self-adaptive weighting on semantic features of a front frame and current frame features, effectively integrating semantic information between frames, and generating a current frame feature representation with more discriminant; Based on the captured semantic similarity between the front frame and the rear frame in the first step, calculating local semantic similarity through the corresponding shallow guiding features by utilizing the deep semantic features of the front frame and the rear frame, generating a learnable space self-adaptive interpolation weight, and carrying out structural alignment and fusion on the features of the adjacent frames; And thirdly, based on the structural alignment and fusion of the second step, sequentially performing target transformation, target detection and target matching, and calculating the intersection ratio IoU between target frames under a common view angle, so that the matching deviation caused by the view angle difference can be remarkably reduced, and the target association is more regularized and geometrically consistent.
2. The method according to claim 1, wherein the step one is specifically that a given continuous video frame sequence is assumed to be The backbone feedforward convolution network for extracting the semantic features of the image is as follows Through a convolution network Calculating to obtain semantic features of corresponding time frames , , For calculating the local similarity between adjacent frame and current frame, the shallow layer features are introduced As a semantic guide; Obtained through a shallow convolution network, the expression is as shown in formula (1): Wherein, the Representation of Feature map of temporal frames at shallow convolution layer, Representing a convolution function mapped to the boot space, And The convolution kernel weights and offsets are represented separately, Representing a convolution calculation.
3. The method of modeling timing consistency of claim 2, wherein for Arbitrary pixel position in temporal frame profile Specifying the adjacent area Calculating the semantic similarity of the front and rear frames and the guiding characteristics of the current frame in the region, wherein the front and rear frame similarity calculation expression is as shown in the formula (2): And obtaining a semantic similarity distribution result.
4. A time sequence consistency modeling method according to claim 3, wherein after obtaining the semantic similarity distribution result, a learnable convolution kernel is introduced The characteristics of the front frame and the rear frame are weighted and fused to generate an estimated current frame representation, and the convolution interpolation result of the front frame and the rear frame is as shown in the formula (3): Then The fusion characteristic of the time instants can be expressed as a convolution interpolation average of the preceding and succeeding time instants.
5. The method of modeling timing consistency according to claim 4, wherein the step two is specifically that a semantic segmentation backbone network is assumed to be The calculation complexity ratio of the non-key frame extraction method and the frame-by-frame feature extraction method based on feature fusion reasoning is as shown in formula (4): Wherein, the Representing the computational complexity of the corresponding module, A convolution interpolation module for implementing feature estimation of non-key frames is represented.
6. The method of modeling timing consistency of claim 5, wherein the module Representing a similarity calculation section for performing a similarity estimation calculation between a local area of a current frame and a corresponding position of a neighboring frame due to And All belong to a lightweight module, and the light weight module, Contains only few shallow structures, the computational complexity is far less than that of the backbone network Formula (4) can be simplified as formula (5): Assume that each Selecting one frame as key frame, and the overall speed-up ratio is 。
7. The modeling method of time sequence consistency according to claim 6, wherein the three-objective transformation is specifically that a fast homography estimation module is adopted for efficiently calculating homography transformation matrixes between any adjacent frames, and the module only carries out direct estimation on part of key frame pairs and approximates transformation relations of other frame pairs in an interpolation mode, so that the calculation complexity is remarkably reduced; Assume that the image coordinates of 2 adjacent frames are And Then satisfy Wherein, the method comprises the steps of, Representing that the two are homogeneous coordinates, and obtaining an actual coordinate by dividing by a last element for normalization; representing homography, which is a perspective transformation of two-dimensional plane points between two images Matrix, the change process is as formula (6): 。
8. The method of modeling timing consistency according to claim 7, wherein the step three target detection is specifically that all input frames generate detection frames and initial semantic ID embedded features of corresponding frames through a shared feature extraction network; Firstly, aligning view angles of target features in adjacent frames through a homography matrix output by a rapid homography estimation module, then introducing a plurality of semantic Slot slots based on a structure of Slot Attention, adaptively extracting ID information with consistent target view angles from the aligned features of a plurality of frames through an Attention mechanism, gradually optimizing semantic Slot representation in a plurality of iterative processes, and outputting updated ID features; Suppose an input video image sequence Let the sampling interval be Selecting key frame sequence to directly calculate corresponding homography matrix sequence Wherein, the method comprises the steps of, Representing the will be Conversion of frame image to the first Homography matrix of frame image; selecting 3 key time frames , And Wherein , , And Is a frame of samples.
9. The method for modeling the time sequence consistency according to claim 8, wherein the step three target matching is specifically implemented by adopting a homography matched filter module to realize target projection matching in a space layer among multiple frames, wherein the filter module uniformly projects detection frames of targets in front and rear frames to the same reference view angle plane through homography matrix calculation, calculates an intersection ratio IoU on the space to realize space-level association, and finally inputs all IoU and ID matching information to a matching algorithm to output a multi-semantic target identification result of each frame; in homography between two frames of images For constraint, features in the previous frame are explicitly projected to the view plane of the current frame to realize the alignment of the spatial consistency features, and the target features detected in the previous frame are set as Its projection characteristics in the current frame are expressed as: Wherein, the Representing a geometric resampling function based on homography transformation, and characterizing the original characteristics of the current frame Aligned cross-frame features For joint input, joint attention input is constructed By iterative mechanism of slot Attention Polymerization, setting initial slot characterization as Each iteration of the training process obtains a slot to input attention allocation weight according to the softmax-based attention mechanism as shown in equation (8): For any two frames And (3) with The module is characterized by the target ID of the current frame And groove position For input, calculate the attention profile as formula (9): Wherein, the Representing slave And The object ID features extracted in the time frame, The semantic slot is represented by a representation of the semantic slot, Representing the feature dimension.
10. A time sequence consistency modeling system facing a dynamic change scene, characterized in that the time sequence consistency modeling system uses the time sequence consistency modeling method facing the dynamic change scene according to any one of claims 1-9, and the time sequence consistency modeling system comprises: The convolution interpolation module based on semantic similarity dynamically captures semantic similarity between front and rear frames through a learnable convolution kernel, a feature propagation network based on depth feature flow is constructed, semantic features of a front frame and current frame features are subjected to nonlinear mapping and self-adaptive weighting, semantic information between frames is effectively integrated, and current frame feature representation with more discriminant is generated; The prediction alignment module based on global projection utilizes semantic similarity between the front frame and the rear frame captured by the convolution interpolation module based on semantic similarity, utilizes deep semantic features of the front frame and the rear frame, calculates local semantic similarity through corresponding shallow guide features of the front frame and the rear frame, generates a learnable space self-adaptive interpolation weight, and performs structural alignment and fusion on adjacent frame features; The prediction alignment module based on global projection sequentially performs target transformation, target detection and target matching, calculates the intersection ratio IoU between target frames under a common view angle, can remarkably reduce matching deviation caused by view angle difference, and enables target association to be more regular and consistent with geometry.

Description

Time sequence consistency modeling method oriented to dynamic change scene Technical Field The invention belongs to the technical field of computer vision and video processing, and particularly relates to a time sequence consistency modeling method for a dynamic change scene. Background Time sequence consistency modeling method and system for dynamic change scene. In the fields of computer vision, video processing, and three-dimensional reconstruction (e.g., dynamic NeRF/3D Gaussian Splatting), how to maintain the consistency of video sequences or rendering results in the time dimension (i.e., instant consistency) becomes a core challenge as application scenes expand from still photography to complex dynamic scenes. The closest prior art of the invention is a time sequence constraint method based on dense optical flow and a frame-by-frame transfer method based on a cyclic neural network. (1) The time sequence constraint method based on dense optical flow utilizes a traditional optical flow algorithm (Lucas-Kanade method) or a deep learning optical flow network (FlowNet, RAFT) to calculate pixel motion offset between adjacent frames and transfer semantic information of historical frames through image warping. The optical flow method is based on the "constant brightness" and "no occlusion" assumptions. In dynamic scenes, when object overlapping or de-occlusion occurs, the optical flow cannot accurately capture the newly appearing pixel sources, resulting in the texture of the previous frame being erroneously "stretched" or "pasted" to the current frame, creating blurred smearing and ghosting phenomena. Dense optical flow calculation is a task which consumes extremely resources, and is difficult to meet scenes with high real-time requirements such as low-altitude aerial photography. (2) Frame-by-frame delivery methods based on recurrent neural networks deliver features in time steps through hidden states in an attempt to let the model "remember" the previous semantic information. In an aerial scene, the camera view angle is frequently changed. Because such models lack explicit geometric alignment mechanisms, the projection positions of the historical features in the current frame can be severely shifted, so that the semantic boundaries generated by the models are dithered and flickering. As sequences grow, RNNs tend to have gradient vanishing or information forgetting, failing to maintain long-term semantic consistency. Existing processing schemes are generally divided into two types (1) a frame-by-frame processing method, namely, processing each frame of video as an independent image (such as style migration, super resolution, denoising and the like) (2) a time sequence constraint method, namely, estimating pixel motion of adjacent frames by using a traditional optical flow, and transmitting characteristic information through image warping. In the face of dynamically changing scenes (i.e., camera motion coupled with non-rigid motion of objects, illumination changes, objects occlusion from each other), the prior art suffers from significant drawbacks of (1) severe visual flicker and jitter. Since the frame-by-frame processing ignores the correlation between frames, the feature representation (e.g., color, texture details) of the same object at different times may randomly jump. In a dynamic scene, inconsistencies can be amplified, resulting in intense flickering visible to the naked eye, severely affecting video quality. (2) dynamic occlusion induced artifacts. The traditional optical flow method assumes that scene brightness is constant and no occlusion exists. In dynamic scenes, however, occlusion and de-occlusion frequently occur between objects. (3) non-rigid motion texture slippage. For walking pedestrians, flowing liquids and other non-rigid deformation objects, simple rigid transformation cannot accurately align the features. Resulting in the resultant texture not being able to cling to the object surface. Aiming at the core problems of insufficient continuous frame semantic redundancy mining, prediction dislocation caused by camera motion, high data labeling cost and the like in a low-altitude aerial photographing scene, the invention provides a space-time semantic segmentation scheme fused with time consistency constraint. Disclosure of Invention The invention provides a time sequence consistency modeling method for a dynamic change scene, which solves the problems of video semantic prediction flickering, artifacts, texture slippage and the like caused by severe camera movement, non-rigid deformation of an object and frequent shielding. The invention is realized by the following technical scheme: A time sequence consistency modeling method facing to a dynamic change scene comprises the following steps: Firstly, dynamically capturing semantic similarity between a front frame and a rear frame through a learnable convolution kernel, constructing a feature propagation network based on a depth feature stre