CN-121981920-A - Video intelligent restoration method and system based on space-time context reasoning

CN121981920ACN 121981920 ACN121981920 ACN 121981920ACN-121981920-A

Abstract

The invention belongs to the technical field of digital media processing, and discloses a video intelligent restoration method and system based on space-time context reasoning. Aiming at the problems of insufficient repairing efficiency, dynamic inconsistency caused by single frame repairing neglected time continuity, logic conflict caused by semantic deletion and the like in a large-area or continuous damage scene in the traditional method, the invention constructs a space-time multidimensional information fusion and reasoning framework and provides a space-time context reasoning generation countermeasure network (STCI-GAN). According to the method, a reference frame is acquired through multi-scale time window sampling, space, time and semantic features are extracted and fused through a semantic enhancement encoder, a hierarchical restoration of semantic constraint is adopted to generate an initial restoration frame, and finally, the multi-dimensional discriminant is used for countermeasure training optimization, so that space reality, time continuity, motion rationality and semantic compliance are realized. The invention improves the repairing efficiency of large-area damage and continuous frame shielding, ensures the real smoothness and semantic consistency of dynamic content, and has high robustness and universality.

Inventors

Chi Huanbin
TIAN XUEFEN
Chang Linfeng
LIU HONGBO

Assignees

云南开放大学(云南国防工业职业技术学院)

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (8)

1. A video intelligent restoration method based on space-time context reasoning is characterized by comprising the following steps: (1) Acquiring a video sequence to be repaired and a corresponding damage area mask; (2) A multi-scale time window sampler is adopted to collect a plurality of reference frames according to exponentially increased time distances of frames to be repaired, so as to form a reference frame set; (3) The method comprises the steps of carrying out space-time information coding and fusion on a reference frame set through a separation attention coding module to obtain a space-time context characteristic map, wherein the separation attention coding module comprises a space attention network and a time attention network which run in parallel, respectively extracting single-frame internal structures and texture characteristics and object motion trail and logic association characteristics of cross frames, and carrying out weighted fusion; (4) And carrying out hierarchical restoration generation based on the space-time context feature map: ① Extracting structural components from the space-time context feature map, and generating a rough structural outline and a rough motion vector field by utilizing a structural guide network; ② Inputting the rough structure outline and the rough motion vector field into a texture generator, and generating a high-definition repair frame by combining a space-time context characteristic map; (5) And performing motion fusion processing, namely performing gating convolution fusion on a rough motion vector field generated by a structure guiding network and a fine optical flow field predicted by an optical flow network to obtain a final motion field, wherein the gating convolution fusion specifically comprises the following steps: a) Splicing the rough motion vector field and the fine optical flow field in the channel dimension; b) Generating a pixel-by-pixel gating graph through a lightweight convolution network, wherein the lightweight convolution network sequentially comprises Conv 3×3 、ReLU、Conv 3×3 and Sigmoid so as to output gating values with values in the range of [0,1 ]; c) Weighting fusion is carried out according to a gating chart, namely, a rough motion vector field is preferentially adopted in a clear structural area, a fine optical flow field is preferentially adopted in a non-shielding texture rich area, and automatic balance is carried out in a shielding or uncertain area, wherein a fusion formula is that Where O is the final motion field, V is the coarse motion vector field, The system is a fine optical flow field, g is a gating chart, and the path of the system indicates element-by-element multiplication, wherein the optical flow network is a RAFT, PWC-Net or similar optical flow estimation network; (6) Performing countermeasure training and optimization on the generated repair frame by adopting a multi-dimensional discriminator module, wherein the multi-dimensional discriminator module comprises a space discriminator, a time discriminator, an optical flow discriminator and a semantic discriminator, and quality evaluation is performed from single frame authenticity, time sequence consistency, motion regularity and semantic logic respectively, and generator parameters are fed back and optimized; (7) And outputting the repaired video frame sequence.
2. The method of claim 1, wherein the set of reference frames for multi-scale time window sampling comprises short-time neighboring frames, occlusion boundary frames, and long-time full pose frames, and wherein the sampling is selected for exponentially increasing temporal distance to capture short-time motion information and long-time scene information simultaneously.
3. The method of claim 1, wherein the spatial attention network integrates a semantic segmentation sub-network for learning structure and texture features within a single frame, and wherein the temporal attention network integrates a semantic attribute tracking module for aligning the same object across frames and inferring its motion trajectories and dynamic priors.
4. The method according to claim 1, wherein in the hierarchical restoration generating step, the structural component extraction is obtained by performing edge detection or low-frequency filtering on a space-time context feature map, the structural guide network is a lightweight network, a rough structural outline and a rough motion vector field are output to ensure reasonable skeleton, boundary and motion trend, the texture generator is a strong generating network, and high-definition textures and color details are filled under the guidance of the structural outline and the motion vector field.
5. The method of claim 1, wherein the loss function of the multi-dimensional discriminant module comprises a spatial discriminant loss for constraining visual reality of a single frame image, a temporal discriminant loss for constraining motion consistency of a short video segment, an optical flow discriminant loss for constraining an optical flow field to conform to a natural motion law, and a semantic discriminant loss for constraining repair content to conform to object semantic rules and cross-frame attribute consistency.
6. The method according to claim 1, wherein the method is applicable to pedestrian walking video repair of moving mosaic shielding, still text region damage video repair caused by camera shake, and other large area shielding, continuous frame damage, random noise elimination, film and television semantic repair scenes.
7. A video intelligent repair system based on space-time context reasoning, comprising: an input layer for receiving an original video sequence and a damaged area mask; The multi-scale time window sampler is used for collecting multi-scale reference frames; The space-time information coding and fusion module comprises a space attention network, a time attention network and a context feature fusion unit and is used for generating a space-time context feature map; the hierarchical restoration generation module comprises a structure guide network and a texture generator and is used for generating a high-definition restoration frame; The motion fusion sub-module is used for carrying out gating convolution fusion on a rough motion vector field generated by the structure guiding network and a fine optical flow field predicted by the optical flow network to obtain a final motion field, wherein the fusion adopts a light convolution network Conv 3×3 →ReLU→Conv 3×3 -Sigmoid to generate a pixel-by-pixel gating graph and fuses the pixel-by-pixel gating graph according to a formula; the multi-dimensional discriminator module comprises a space discriminator, a time discriminator, an optical flow discriminator and a semantic discriminator and is used for carrying out multi-dimensional quality assessment and countermeasure training on the repair frame; And the output layer is used for outputting the repaired video frame sequence.
8. The system of claim 7, wherein the optical flow network is a RAFT, PWC-Net, or homogeneous optical flow estimation network.

Description

Video intelligent restoration method and system based on space-time context reasoning Technical Field The invention belongs to the technical field of digital media processing, and particularly relates to a video intelligent restoration method and system based on space-time context reasoning. Background The core goal of the video repair technology is to recover and reconstruct the video content with damage, deletion or shielding so as to meet the requirements of visual reality and time sequence continuity. Current mainstream video repair techniques can be divided into two categories: the first type is a traditional repair method, mainly based on Optical Flow (Optical Flow) or block matching (PATCH MATCHING) technology, and the core mechanism is to search similar pixel blocks in adjacent frames of the frame to be repaired, so as to complete the filling of a damaged area. However, the method has the obvious limitations that when the video has severe motion, abrupt illumination change or large-area damage, the matching precision of similar pixel blocks is greatly reduced, the problems of fuzzy repair results, distortion, ghost artifacts and the like are easily caused, and for the shielding (such as station marks and watermarks) of the same position of continuous multiframes, the method is generally difficult to realize effective repair because adjacent frames can not provide effective reference information. The second type is a single frame repair method based on deep learning, which regards video frames as independent images and adopts a deep learning model (such as U-Net, GAN and the like) in the field of image repair (IMAGE INPAINTING) to process. The method has the core defects that the sequential continuity essence of the video is ignored, a single-frame intra-repair result possibly has visual rationality, but the problems of flickering, jitter, inconsistent content and the like are easy to occur in the sequential playing process (such as unnatural mutation of the repaired wave form in adjacent frames), and the dynamic authenticity of the video is seriously damaged. In addition, the prior art has the key problem of general semantic understanding deficiency that the traditional method only carries out matching filling based on the gray level/texture similarity of pixel blocks, the single-frame deep learning method focuses on the generation optimization of local textures in a single frame, a semantic cognition system of video content is not established, namely, the inherent attributes (such as category, form and functional characteristics) of an object cannot be identified, and cross-frame semantic consistency (such as time sequence continuity of the attribute, state and logic relation of the same object) is difficult to capture, so that the repairing process is limited to logic-free pixel level filling. The method is characterized in that the traditional method cannot distinguish key semantic parts of the object from background areas, part mismatch is easy to occur (such as filling tree textures into human arm areas), semantic paradox is easy to occur in the single-frame method (such as generating an object with incomplete structure), and mutation of the cross-frame attribute of the same object is likely to occur (such as changing clothing textures from stripes to solid colors). In professional scenes such as video repair and monitoring video optimization, the problems can damage the authenticity of the content or cause evidence failure, and cannot meet professional repair requirements. To sum up, the prior art is difficult to deal with large-area or continuous damage scenes due to excessive dependence on local space-time information, or ignores time sequence relativity, so that dynamic continuity is insufficient, or semantic understanding capability is lacking, and logic conflict of repair contents is caused. The method does not realize high-level semantic understanding and logical reasoning of video content, and is difficult to meet the high-quality video restoration requirement in complex scenes. Disclosure of Invention The invention aims to overcome the defects of the prior video repair technology, provides a video intelligent repair method and a system based on space-time context reasoning, solves the problems of repair bottleneck of the traditional method under large-area/continuous damage scenes, dynamic inconsistency of a single-frame repair method and repair content logic conflict caused by semantic understanding missing, realizes efficient, high-quality and semantic compliance repair of various video damage scenes, and is particularly focused on the video data repair technology based on deep learning, particularly on realizing intelligent reconstruction of video damage or missing frame areas by utilizing space-time context information, semantic understanding mechanisms and generation countermeasure networks, and can be applied to multiple video repair scenes such as video watermarking removal, sta