CN-122027808-A - Video encoding and decoding method, video processing system and electronic equipment

CN122027808ACN 122027808 ACN122027808 ACN 122027808ACN-122027808-A

Abstract

The application discloses a video encoding and decoding method, a video processing system and electronic equipment. The method comprises the steps of obtaining a target video to be transmitted, dividing the target video into a plurality of video segments, determining a first frame and a last frame of the video segments as key frames, performing compression coding on the key frames to obtain key frame data units, performing feature extraction on intermediate frames positioned between the first frame and the last frame in the video segments to obtain latent space data units and contour data units, and packaging the key frame data units, the latent space data units and the contour data units to obtain a mixed code stream corresponding to the target video. The application solves the technical problems that the generated video compression in the related technology lacks standardized and structured code stream packaging protocols to cooperatively transmit the latent space characteristics, the structure priori and the key frame data.

Inventors

LI XUELONG
ZHANG CHI
WANG WENYI

Assignees

中国电信股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (13)

1. A video encoding method, comprising: acquiring a target video to be transmitted, and dividing the target video into a plurality of video clips; Determining a first frame and a last frame of the video segment as key frames, and performing compression coding on the key frames to obtain key frame data units, wherein the key frame data units comprise key frame compressed data processed by a compression algorithm; Performing feature extraction on an intermediate frame positioned between a first frame and a last frame in the video segment to obtain a latent space data unit and a contour data unit, wherein the latent space data unit comprises a low-dimensional latent space feature tensor of the intermediate frame after feature extraction, the low-dimensional latent space feature tensor is used for representing motion features and detail features of objects in the intermediate frame, and the contour data unit is used for representing geometric structural features of the objects in the intermediate frame; and packaging the key frame data unit, the latent space data unit and the contour data unit to obtain a mixed code stream corresponding to the target video.
2. The video encoding method of claim 1, wherein dividing the target video into a plurality of video segments comprises: Determining a target division mode, wherein the target division mode comprises at least one of a fixed step division mode, a division mode based on lens switching detection, a division mode based on motion complexity driving and a division mode based on semantic integrity; determining the dividing boundary of each video segment in the target video according to the target dividing mode, wherein the tail frame of the kth video segment is the first frame of the (k+1) th video segment, and k is a positive integer; and dividing the target video according to the dividing boundary to obtain a plurality of obtained video fragments.
3. The method of video coding according to claim 1, wherein compression encoding the key frames to obtain key frame data units comprises: Determining a target compression algorithm, wherein the target compression algorithm comprises at least one of a video coding standardized compression algorithm, a neural network image compression algorithm and a downsampling-super-resolution reconstruction compression algorithm; adopting the target compression algorithm to compress the key frame to obtain key frame compressed data; Constructing a key frame data unit structure, determining information contained in a head part in the key frame data unit structure, and filling a main body part in the key frame data unit structure with the key frame compression data so as to obtain the final key frame data unit, wherein the information contained in the head part comprises a frame index of the key frame on a global time axis, the number of the key frames contained in the key frame data unit and the byte length of the main body part.
4. The video coding method of claim 1, wherein extracting features from an intermediate frame of the video segment between a first frame and a last frame to obtain a latent space data unit comprises: performing nonlinear mapping from a pixel space to a latent space on an intermediate frame sequence to obtain an initial latent space feature, and performing downsampling on the initial latent space feature to obtain a compact latent space feature, wherein the downsampling is used for reducing the spatial resolution of the initial latent space feature; Determining a zero offset and a scale factor, and carrying out quantization processing on the compact latent space feature according to the zero offset and the scale factor, wherein the quantization processing is used for mapping the floating point type latent space feature into integer representation; entropy coding is carried out on the compact latent space characteristics after the quantization processing, and a binary stream of the latent space characteristic values is obtained; And constructing a latent space data unit structure, determining information contained in a head part in the latent space data unit structure, and filling a main body part in the latent space data unit structure with the binary stream of the latent space characteristic value so as to obtain the final latent space data unit, wherein the information contained in the head part comprises an index mark of a video segment to which the intermediate frame sequence belongs, a frame length of the latent space characteristic sequence, width and height of a feature map after downsampling, the channel number of a latent space characteristic tensor, the zero offset and the scale factor.
5. The video coding method of claim 1, wherein extracting features from an intermediate frame of the video segment between a first frame and a last frame to obtain a contour data unit comprises: Performing edge detection on the intermediate frame to obtain an edge point set, and performing parameterization curve fitting on the edge point set to obtain a parameterization curve control point set, wherein the parameterization curve fitting is used for converting a dense edge point set into sparse control point parameters so as to reduce the transmission code rate of profile information; Performing space-time sparsification processing on the parameterized curve control point set to obtain a sparse control point set, and performing entropy coding on the sparse control point set to obtain contour edge characteristic data, wherein the space-time sparsification processing is used for downsampling control points in a time dimension and a space dimension so as to further compress the contour data; and constructing a contour data unit structure, determining information contained in a head part in the contour data unit structure, and filling a main body part in the contour data unit structure with the contour edge characteristic data so as to obtain a final contour data unit, wherein the information contained in the head part comprises an index identifier of a video segment to which the intermediate frame belongs, the total number of edge points contained in the contour edge characteristic data and a sampling step length, and the sampling step length is used for indicating the sparseness of edge description.
6. The video encoding method according to claim 1, wherein encapsulating the key frame data unit, the latent space data unit, and the contour data unit to obtain the mixed code stream corresponding to the target video comprises: Determining a sequence parameter set, wherein the sequence parameter set is used for representing global decoding parameters of the target video; Determining a target encapsulation mode, wherein the target encapsulation mode comprises at least one of a NAL unit encapsulation mode and a TLV encapsulation mode; And constructing a packaging head according to the target packaging mode, and packaging the sequence parameter set, the key frame data unit, the potential space data unit and the outline data unit to obtain the mixed code stream, wherein the packaging head is used for representing type identification and boundary information of the data unit.
7. The video coding method of claim 6, wherein determining a sequence parameter set comprises: Constructing a sequence parameter set structure and determining various pieces of information contained in the sequence parameter set structure, wherein the information comprises a unique identifier of the sequence parameter set, original resolution information of the target video, the number of video fragments into which the target video is divided, an algorithm identifier for representing a target compression algorithm adopted when a key frame is compressed, version information of a condition generation model corresponding to the mixed code stream, and a mode identifier for representing a target division mode adopted when the target video is divided into a plurality of video fragments, and the condition generation model is a model used by a decoding end for reconstructing the target video according to the mixed code stream.
8. A video decoding method, comprising: Acquiring a mixed code stream output by a coding end, and extracting a key frame data unit, a latent space data unit and a contour data unit which are packaged in the mixed code stream, wherein the key frame data unit comprises key frame compressed data processed by a compression algorithm, the key frame is a head frame and a tail frame of each video segment obtained by dividing a target video, the latent space data unit comprises a low-dimensional latent space feature tensor obtained by extracting features of an intermediate frame, the low-dimensional latent space feature tensor is used for representing motion features and detail features of an object in the intermediate frame, the intermediate frame is a video frame positioned between the head frame and the tail frame in the video segment, and the contour data unit is used for representing geometric structural features of the object in the intermediate frame; decoding the key frame data unit to reconstruct key frames of the video clips, and mapping the reconstructed key frames to a latent space to obtain key frame feature anchor points, wherein the key frame feature anchor points are used for representing color, texture and object layout references of the video clips; Decoding the latent space data unit, restoring the latent space features in the latent space data unit to the resolution matched with the condition generation model to obtain latent space restoring features, and restoring initial contour information by decoding the contour data unit and mapping the initial contour information to the latent space to obtain structure driving priori information, wherein the structure driving priori information is used for representing the aggregate topological structure and boundary evolution information of objects in the intermediate frame; And taking the key frame characteristic anchor points, the latent space restoring characteristics and the structure driving priori information as constraint condition input condition generation models, obtaining the latent space representation of the target video restored by the condition generation models under the constraint condition, and mapping the latent space representation to a pixel space to obtain the target video.
9. The video decoding method of claim 8, wherein extracting key frame data units, latent space data units, and contour data units encapsulated in the mixed bitstream comprises: The method comprises the steps of reading a sequence parameter set in a mixed code stream, wherein the sequence parameter set comprises a unique identifier of the sequence parameter set, original resolution information of a target video, the number of video fragments into which the target video is divided, an algorithm identifier for representing a target compression algorithm adopted when a key frame is compressed, version information of a condition generation model corresponding to the mixed code stream, and a mode identifier for representing a target division mode adopted when the target video is divided into a plurality of video fragments, and the condition generation model is a model used by a decoding end for reconstructing the target video according to the mixed code stream; And analyzing the mixed code stream according to the sequence parameter set to obtain the key frame data unit, the latent space data unit and the contour data unit.
10. A video processing system is characterized by comprising an encoding end and a decoding end, wherein, The encoding end is used for acquiring a target video to be transmitted and dividing the target video into a plurality of video clips; the method comprises the steps of determining a first frame and a last frame of a video segment as key frames, carrying out compression coding on the key frames to obtain key frame data units, wherein the key frame data units comprise key frame compression data processed by a compression algorithm, carrying out feature extraction on intermediate frames positioned between the first frame and the last frame in the video segment to obtain a latent space data unit and a contour data unit, wherein the latent space data unit comprises a low-dimensional latent space feature tensor of the intermediate frames after feature extraction and is used for representing motion features and detail features of objects in the intermediate frames, and the contour data unit is used for representing geometric structural features of the objects in the intermediate frames; The decoding end is used for obtaining a mixed code stream output by the encoding end, extracting the key frame data unit, the latent space data unit and the outline data unit which are packaged in the mixed code stream, decoding the key frame data unit to reconstruct key frames of all video clips, mapping the reconstructed key frames to a latent space to obtain key frame characteristic anchor points, wherein the key frame characteristic anchor points are used for representing color, texture and object layout references of the video clips, decoding the latent space data unit, restoring latent space characteristics in the latent space data unit to a resolution ratio matched with a condition generation model to obtain a latent space restoring characteristic, and restoring initial outline information and mapping the initial outline information to a latent space to obtain structure driving priori information, wherein the structure driving priori information is used for representing aggregate topological structure and boundary information of objects in the intermediate frames, the key frame characteristic anchor points, the latent space restoring characteristic and the structure driving condition are used for representing the video clips, the latent space driving condition is used as constraint condition generating model, and the constraint condition is used for generating a video image representing the latent space under the constraint condition generating model, and the constraint condition is used for generating a latent space representing the constraint condition.
11. An electronic device comprising a memory and a processor for executing a program stored in the memory, wherein the program is executed to perform the video encoding method of any one of claims 1 to 7 or the video decoding method of any one of claims 8 to 9.
12. A non-volatile storage medium, characterized in that the non-volatile storage medium comprises a stored computer program, wherein the device in which the non-volatile storage medium is located performs the video encoding method of any one of claims 1 to 7 or the video decoding method of any one of claims 8 to 9 by running the computer program.
13. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the video encoding method of any one of claims 1 to 7 or the video decoding method of any one of claims 8 to 9.

Description

Video encoding and decoding method, video processing system and electronic equipment Technical Field The present application relates to the field of video coding technologies, and in particular, to a video coding and decoding method, a video processing system, and an electronic device. Background With the popularity of high definition/ultra high definition video applications, the volume of video data grows exponentially, creating a significant challenge for storage and bandwidth. Conventional mainstream video coding standards (e.g., h.265/HEVC, h.266/VVC) employ a block-based hybrid coding framework to eliminate spatial and temporal redundancy in video data through intra/inter prediction, transform, quantization, and entropy coding. The method has good rate distortion performance under the medium and high code rate, and is widely applied to scenes such as broadcast television, streaming media, high-definition storage and the like. However, in very low code rate communication scenarios (such as satellite communication, ocean monitoring, emergency rescue, low power consumption internet of things), conventional encoders face serious performance bottlenecks. In the related art, the emerging generation type video compression technology tries to break through the performance bottleneck faced by the traditional encoder through the paradigm of 'changing bandwidth with computing power', the core idea is to transmit only the latent spatial features and structure guiding information of very low bit rate, and a depth generation model is utilized to reconstruct high-quality video at the decoding end, so that the code rate saving potential far exceeding the traditional standard is shown in experiments. However, the generation scheme in the related art is mostly a closed-end prototype system, lacks a standardized, structured and expandable code stream encapsulation protocol, has no unified transmission specification among latent space features, geometric structure priors (such as outlines and skeletons) and key frame data, cannot be compatible with the existing video transmission system (such as NAL units), has direct transmission of structure priors in a high-resolution bitmap or dense feature graphics, cannot realize semantic parameterization and sparse expression, has unacceptable bandwidth overhead, and lacks an anchor constraint mechanism, so that time domain flicker, object deformation and illusion distortion are easy to occur in the generation result, and the space-time continuity and semantic fidelity of video are difficult to ensure. In view of the above problems, no effective solution has been proposed at present. Disclosure of Invention The embodiment of the application provides a video encoding and decoding method, a video processing system and electronic equipment, which at least solve the technical problems that the generated video compression lacks a standardized and structured code stream packaging protocol in the related technology so as to cooperatively transmit latent space characteristics, structure priori and key frame data. According to one aspect of the embodiment of the application, a video coding method is provided, which comprises the steps of obtaining a target video to be transmitted, dividing the target video into a plurality of video segments, determining a first frame and a last frame of the video segments as key frames, performing compression coding on the key frames to obtain key frame data units, wherein the key frame data units comprise key frame compressed data processed by a compression algorithm, performing feature extraction on an intermediate frame positioned between the first frame and the last frame in the video segments to obtain a latent space data unit and a contour data unit, wherein the latent space data unit comprises a low-dimensional latent space feature tensor of the intermediate frame after feature extraction and is used for representing motion features and detail features of objects in the intermediate frame, and the contour data unit is used for representing geometric structural features of the objects in the intermediate frame, and packaging the key frame data unit, the latent space data unit and the contour data unit to obtain a mixed code stream corresponding to the target video. Optionally, dividing the target video into a plurality of video segments comprises determining a target division mode, wherein the target division mode comprises at least one of a fixed step division mode, a division mode based on shot switching detection, a division mode based on motion complexity driving and a division mode based on semantic integrity, determining a division boundary of each video segment in the target video according to the target division mode, wherein a tail frame of a kth video segment is a first frame of a k+1th video segment, k is a positive integer, and dividing the target video according to the division boundary to obtain a plurality of obtained video segments. Op