CN-122023560-A - Video coloring model construction method, video coloring method, device and medium

CN122023560ACN 122023560 ACN122023560 ACN 122023560ACN-122023560-A

Abstract

The invention discloses a video coloring model construction method, a video coloring method, equipment and a medium, wherein the method comprises the steps of sampling key frames from a gray video sequence; extracting scene description text and generating a color reference image, constructing a basic video generation model, introducing a main branch with frozen parameters and a control branch with trainable parameters, constructing a composite visual sequence and setting a time sequence mask sequence, extracting space-time control features in the control branch, overlapping and injecting residual errors into the main branch, introducing an alignment loss function training control branch, extracting deep target semantic features of the color reference image, mapping the middle features of denoising of the main branch into a semantic space, setting a loss function to jointly construct a total training target and minimize, performing iterative optimization on specified parameters to obtain a video coloring model, and executing video coloring operation. The invention can realize the effective decoupling of single-frame content color editing and long-sequence time sequence color propagation, and improves the color fidelity and dynamic time sequence consistency of the generated video.

Inventors

BAO BINGKUN
YOU SISI
Niu Chaochao

Assignees

南京邮电大学

Dates

Publication Date: 20260512
Application Date: 20260409

Claims (10)

1. A method for constructing a video coloring model, comprising: s1, acquiring a gray video sequence to be processed, and sampling a key frame; s2, extracting a scene description text of the key frame, and carrying out semantic perception coloring based on the key frame and the scene description text to generate a high-fidelity color reference image; s3, constructing a basic video generation model based on a diffusion converter, and introducing a main branch with frozen parameters and a control branch with trainable parameters into the basic video generation model; S4, splicing a color reference image and a gray video sequence along a time sequence dimension to construct a composite visual sequence, wherein the color reference image is used as a reference frame, a time sequence position corresponding to the gray video sequence is used as a target generation frame, a time sequence mask sequence for distinguishing the reference frame from the target generation frame is correspondingly arranged, the composite visual sequence, the time sequence mask sequence and a scene description text are input into a control branch to extract space-time control features, and the space-time control features are injected into a main branch in a residual superposition mode to guide the main branch to perform color time sequence propagation while retaining original motion dynamics; S5, training a control branch, introducing an alignment loss function based on teacher-student feature distillation in the training process, extracting deep target semantic features of a color reference image by using a frozen teacher network, and simultaneously mapping intermediate features in the main branch denoising process to corresponding semantic spaces by using a trainable feature projector to obtain generated features; s6, setting a flow matching loss function, constructing a total training target by combining the alignment loss function, and carrying out iterative optimization on parameters of a control branch and a characteristic projector by minimizing the total training target to finally obtain a trained video coloring model.
2. The method for constructing a video coloring model according to claim 1, wherein the step S4 specifically comprises: s41, splicing the color reference image and the gray video sequence in the time dimension to generate a composite visual sequence with a uniform format; S41, splicing the color reference image and the gray video sequence in the time dimension to generate a composite visual sequence with a unified format, wherein the specific splicing formula is as follows: Wherein, the For the said composite visual sequence, For the color reference image to be used, For the said sequence of greyscale video, Representing a stitching operation along a time dimension; S42, constructing a binary mask sequence which is strictly aligned with the composite visual sequence in the time dimension as the time sequence mask sequence, wherein the specific construction formula is as follows: Wherein, the For the sequence of the timing mask, For an all zero mask matching the color reference image dimension and assigned to the color reference image, indicating that it is an inactive frame that provides constant color and style constraints; A full mask that matches the grayscale video sequence dimension and is assigned to the grayscale video sequence, indicating that it is a target active frame that requires color generation; S43, combining the vision sequences With the timing mask sequence Splicing in channel dimension after encoding to potential space, and constructing a unified video condition unit by combining scene description text and inputting the unified video condition unit to the control branch, wherein the specific formula of the unified video condition unit is as follows: Wherein, the For the unified video condition unit input to the control branch, A text prompt word condition corresponding to the scene description text is provided; S44, processing the control potential variables input into the control branch through the control branch to extract space-time control characteristics, adjusting characteristic weights by utilizing a zero convolution layer connected to the output end of the control branch network layer, wherein the initial weights of the zero convolution layer are set to be zero, and then adding the space-time control characteristics adjusted by the zero convolution layer with intermediate hidden states of corresponding levels of the main branch to realize residual injection of the characteristics, wherein a specific characteristic residual injection formula is as follows: Wherein, the For the next level of features of the main branch after residual steering, And Representing the first of said main branch and said control branch, respectively The intermediate hidden state of the individual network layers, Representing the zero convolution layer in question, Representing the next level network module of the primary branch.
3. The method for constructing a video coloring model according to claim 2, wherein the step S5 specifically includes: S51, adopting a pre-trained self-supervision visual model as a frozen teacher network, carrying out feature coding on the color reference image, and extracting high-level semantic features with strong discriminant as deep-level target semantic features; S52, performing video denoising diffusion operation, and acquiring middle characteristics of main branches output at a preset shallow network level in the video denoising diffusion process; s53, performing dimension conversion and space mapping on intermediate features in a noise potential space by using a feature projector formed by a lightweight neural network layer, and bridging the intermediate features to a clean target semantic space to obtain generated features; S54, calculating the mean value of the negative cosine similarity between each feature block in the generated features and the corresponding feature block in the deep target semantic features, and taking the mean value as an alignment loss function.
4. The method for constructing a video coloring model according to claim 3, wherein the step S6 specifically includes: S61, based on a correction flow method, under a specified time step, gaussian noise and input condition sequence, predicting a predicted speed field corresponding to a noise potential variable by using a video generation model; S62, calculating a mean square error between a predicted speed field and a real target speed field as a stream matching loss function; s63, constructing a total training target through the combination of the stream matching loss function and the alignment loss function, and performing minimization constraint to iteratively train parameters of control branches and feature projectors in the video coloring model.
5. The method for constructing a video coloring model according to claim 4, wherein the calculation formula of the total training target is as follows: Wherein, the For the purpose of the overall training goal, For the stream to match the loss function, In order to align the loss function, The equilibrium super-parameters of the intensity are guided for controlling the deep semantic.
6. The method for constructing a video coloring model according to claim 1, wherein the method further comprises the steps of establishing a text-guided video coloring evaluation prompt word set, and specifically comprises the following steps: Decomposing the video coloring evaluation dimension into two dimensions of space color fidelity and time sequence dynamic consistency; subdividing the space color fidelity into preset instance color control and global scene atmosphere rendering, subdividing the time sequence dynamic consistency into static shielding recovery and dynamic object color tracking; And constructing a test prompt word set containing different text descriptions and motion scenes based on the dimensions, wherein the test prompt word set is used for carrying out staged verification in the model training process and carrying out comprehensive evaluation on the target color propagation accuracy and the long-sequence color drift suppression capability after the model training is completed.
7. A video coloring method, which is applied to a video coloring model constructed by the method of any one of claims 1 to 6, comprising: receiving a target gray video sequence to be processed and a target prompt word for describing the color of a target scene; Sampling a representative key frame from a target gray video sequence, inputting the representative key frame and a target prompt word into an image generation tool or a coloring tool, and generating a high-fidelity color reference image with a designated color style as a global style anchor point; splicing a color reference image and a target gray video sequence along a time sequence dimension to construct a composite visual sequence, and generating a time sequence mask sequence which is strictly aligned with the composite visual sequence, wherein all-zero masks are allocated for the color reference image, and all-one masks are allocated for the target gray video sequence; and inputting the composite visual sequence, the time sequence mask sequence and the target prompt word into a video coloring model, controlling space-time control features extracted by the branches to conduct residual error guiding on the frozen main branches, and finally generating the target color video which accurately reproduces the reference color in the space dimension and maintains strict time sequence consistency in the time dimension.
8. The video coloring method according to claim 7, wherein in the process of performing residual guiding on the frozen main branch by using the spatio-temporal control feature extracted by the control branch in the video coloring model, a feature residual injection process is further provided, and a specific calculation formula of feature residual injection is as follows: Wherein, the And Representing the first of the main branch and the control branch, respectively The intermediate hidden state of the individual network layers, Representing a zero convolution layer for adjusting feature injection weights, Representing the next level network module of the main branch.
9. A computer readable storage medium, characterized in that a computer program is stored, which, when being executed by a processor, causes the processor to perform the steps of the method according to any of claims 1 to 8.
10. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 8.

Description

Video coloring model construction method, video coloring method, device and medium Technical Field The invention relates to the technical field of computer vision data processing, in particular to a video coloring model construction method, a video coloring method, equipment and a storage medium for text semantic guidance and space-time target decoupling by adopting a video processing and generating technology based on a diffusion model. Background In the field of multimedia processing, video coloring is a key technology, plays a crucial role in aesthetic enhancement, film and television production, cultural heritage restoration and the like, and the main aim of the task is to generate reasonable and vivid color information for a gray-scale video sequence. With the rapid development of deep learning technology, although single-frame image coloring technology based on text guidance has made significant progress and can provide fine-granularity control, a great challenge is still faced by directly expanding these capabilities to the video field, and high-quality video coloring not only requires high-fidelity and semantically accurate color generation in the spatial dimension, but also requires strict continuity and consistency to be maintained in the temporal dimension, so that a plurality of bottlenecks still exist in the video coloring task in the current technology: Firstly, due to complexity of video time sequence dynamics, a naive frame-by-frame coloring strategy or an existing advanced image coloring model (such as image coloring based on a diffusion model) is directly applied, so that serious inter-frame inconsistency is often caused, and flicker artifacts with visual interference are generated in semantic areas such as sky, clothes and the like; secondly, a post-processing paradigm of 'coloring before smoothing' is adopted, and a time domain smoothing algorithm is seriously relied on to forcedly align the independently colored frames, so that the practice generally leads to remarkable reduction of color saturation and loss of high-frequency chromaticity details, and new structural artifacts are easily introduced into a complex scene; Thirdly, the combined cross-frame modeling paradigm is adopted to try to complete coloring and cross-frame feature interaction simultaneously in a single generation model, so that the method not only can cause extremely high calculation cost, but also can cause content feature leakage and color drift phenomenon in long-sequence video extremely easily due to weaker space-time correspondence and insufficient decoupling of object level when facing dynamic video with rapid motion, shielding or scene switching. Therefore, the application provides a video coloring model construction method, a video coloring device and a video coloring medium, which can effectively decouple single-frame content color editing and long-sequence time sequence propagation, and ensure accurate time sequence propagation of high-fidelity colors while eliminating long-term color drift and combination weight interference by integrating advanced image color prior and dynamic smoothing characteristics of a video generation model so as to solve the technical problems. Disclosure of Invention The invention mainly aims to provide a video coloring model construction method and a video coloring method, which are used for solving the technical problems of serious inter-frame flicker, long sequence color drift and poor generation flexibility in video coloring tasks, which are proposed in the background art, so as to realize effective decoupling of single-frame content color editing and long sequence time sequence color propagation and improve color fidelity and dynamic time sequence consistency of generated video. The invention adopts the following technical scheme to solve the technical problems: a video coloring model construction method and a video coloring method are executed by computer equipment, and the method comprises the following steps: S1, acquiring a gray video sequence to be processed, and sampling a key frame from the gray video sequence to provide a space structure foundation; S2, extracting a scene description text of a key frame by using an existing visual language model, inputting the key frame and the scene description text into an existing image coloring model for semantic perception coloring, and generating a high-fidelity color reference image, wherein the color reference image is used as a global style anchor point; S3, constructing a basic video generation model based on a diffusion converter, and introducing a double-branch architecture into the basic video generation model, wherein the double-branch architecture comprises a main branch with frozen parameters and a control branch with trainable parameters; S4, splicing a color reference image and a gray video sequence along a time sequence dimension to construct a composite visual sequence, wherein the color reference image is used as a