CN-122027798-A - Video encoding method and device, storage medium and electronic equipment
Abstract
The application discloses a video coding method and device, a storage medium and electronic equipment. The method comprises the steps of obtaining a routing data packet sent by an encoding end, wherein the routing data packet at least comprises target resolution of a first video and corresponding first compression feature representation, up-sampling the first compression feature representation according to the target resolution to obtain first recovery feature representation, and analyzing the first recovery feature representation by utilizing a mixed model to obtain a plurality of resolution videos corresponding to the first video, wherein the input of a first video generation model in the mixed model is the first recovery feature representation, and the input of each video generation model after the first video generation model is the resolution video and the first recovery feature representation output by a previous video generation model. The method solves the technical problems that the related technology cannot encode a single independent code stream to obtain videos with multiple resolutions, so that the video transmission resource consumption is high and the encoding flexibility is poor.
Inventors
- LI XUELONG
- Yi Fangqiu
- ZHANG CHI
Assignees
- 中国电信股份有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260414
Claims (10)
- 1. A video encoding method, comprising: Obtaining a routing data packet sent by an encoding end, wherein the routing data packet at least comprises a target resolution of a first video and a first compression characteristic representation corresponding to the first video; upsampling the first compressed feature representation according to the target resolution to obtain a first restored feature representation; And analyzing the first recovery feature representation by using a pre-trained mixed model to obtain a plurality of resolution videos corresponding to the first video, wherein the mixed model comprises a plurality of cascaded video generation models, the input of a first video generation model is the first recovery feature representation, and the input of each video generation model after the first video generation model is the resolution video and the first recovery feature representation output by the previous video generation model.
- 2. The method of claim 1, wherein the training process of the hybrid model comprises: constructing an initial model consisting of a plurality of cascaded video generation models; obtaining multiple groups of training sample data, wherein each group of training sample data comprises a second video and a second compression characteristic representation corresponding to the second video; And performing iterative training on the initial model by utilizing a plurality of groups of training sample data to obtain the mixed model.
- 3. The method of claim 2, wherein obtaining a plurality of sets of training sample data comprises: Acquiring a plurality of second videos, wherein the resolution of each second video is different; For each second video, encoding and compressing the second video to obtain a corresponding low-dimensional feature representation, performing channel expansion on the low-dimensional feature representation to obtain a corresponding enhancement feature representation, performing multi-level downsampling on the enhancement feature representation to obtain a corresponding compressed enhancement feature representation, and performing mask processing on the compressed enhancement feature representation to obtain a second compressed feature representation of the second video; And forming a plurality of groups of the training sample data by a plurality of the second videos and a second compression characteristic representation of each second video.
- 4. A method according to claim 3, wherein multi-level downsampling the enhancement feature representation to obtain a corresponding compressed enhancement feature representation comprises: Respectively compressing the time dimension, the height dimension and the width dimension of the enhancement feature representation by using a three-dimensional convolution layer with the stride of a first numerical value to obtain a low-level compression enhancement feature representation; Performing feature fusion on a result obtained by downsampling the enhancement feature representation by a three-dimensional convolution layer with a stride of a first value and a result obtained by performing feature enhancement on the low-level compression enhancement feature representation to obtain a first intermediate compression enhancement feature representation, and respectively compressing a time dimension, a height dimension and a width dimension of the first intermediate compression enhancement feature representation by the three-dimensional convolution layer with a stride of a second value to obtain an intermediate compression enhancement feature representation; Performing feature fusion on a result obtained by downsampling the first intermediate compression enhancement feature representation by a three-dimensional convolution layer with a stride of a second value and a result obtained by performing feature enhancement on the intermediate compression enhancement feature representation to obtain a second intermediate compression enhancement feature representation, and respectively compressing the height dimension and the width dimension of the second intermediate compression enhancement feature representation by a three-dimensional convolution layer with a stride of a third value to obtain a high compression enhancement feature representation; and carrying out feature fusion on the result obtained after downsampling the enhancement feature representation by using a three-dimensional convolution layer with the stride being the product of the first value, the second value and the third value and the highly compressed enhancement feature representation to obtain the compressed enhancement feature representation.
- 5. A method according to claim 3, wherein masking the compressed enhancement feature representation to obtain a second compressed feature representation of the second video comprises: Masking the compression enhancement feature representation according to preset masking information to obtain a corresponding masking tensor, wherein the masking information comprises at least one of the number of masking blocks, the space size of each masking block, the time span and the channel span; And multiplying the mask tensor with the compression enhancement feature representation element by element to obtain a second compression feature representation of the second video.
- 6. The method of claim 1, wherein analyzing the first restored feature representation using a pre-trained hybrid model results in a plurality of resolution videos corresponding to the first video, comprising: Inputting the first restoration feature representation into a first video generation model in the mixed model to obtain a first resolution video output by the first video generation model; And for each video generation model after the first video generation model in the mixed model, upsampling a second resolution video output by a previous video generation model of the current video generation model to obtain a third resolution video, wherein the resolution of the third resolution video is the same as that of the resolution video output by the current video generation model, and inputting the third resolution video and the first recovery feature representation into the current video generation model to obtain the second resolution video output by the current video generation model, wherein the resolution of the second resolution video output by the current video generation model is higher than that of the second resolution video output by the previous video generation model of the current video generation model.
- 7. The method of claim 1, wherein upsampling the first compressed feature representation in accordance with the target resolution to obtain a first restored feature representation comprises: And carrying out sampling interpolation on the first compression characteristic representation in the height dimension and the width dimension to obtain a first recovery characteristic representation with the same spatial size as the target resolution of the first video.
- 8. A video encoding apparatus, comprising: the acquisition module is used for acquiring a routing data packet sent by the encoding end, wherein the routing data packet at least comprises a target resolution of a first video and a first compression characteristic representation corresponding to the first video; the up-sampling module is used for up-sampling the first compression characteristic representation according to the target resolution to obtain a first recovery characteristic representation; The coding module is used for analyzing the first recovery feature representation by utilizing a pre-trained mixed model to obtain a plurality of resolution videos corresponding to the first video, wherein the mixed model comprises a plurality of cascaded video generation models, the input of a first video generation model is the first recovery feature representation, and the input of each video generation model after the first video generation model is the resolution video and the first recovery feature representation output by the previous video generation model.
- 9. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and wherein a device in which the computer readable storage medium is located performs the video encoding method according to any one of claims 1 to 7 by running the computer program.
- 10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the memory has stored therein a computer program, the processor being configured to perform the video encoding method of any of claims 1 to 7 by means of the computer program.
Description
Video encoding method and device, storage medium and electronic equipment Technical Field The present application relates to the field of video coding technologies, and in particular, to a video coding method and apparatus, a storage medium, and an electronic device. Background With the popularization of video applications, user terminal devices (such as mobile phones, tablets, smart televisions, etc.) and network environments (4G, 5G, wi-Fi, low-bandwidth areas, etc.) present a high degree of isomerization, which presents a serious challenge for the adaptability of video coding techniques. Although the traditional scalable video coding (Scalable Video Coding, SVC) method can realize resolution self-adaptation through multi-layer code streams, the essence of the method still depends on an artificial design framework based on transformation (such as discrete cosine transformation) and hierarchical quantization, so that an encoding end needs to independently generate and transmit independent code streams for each resolution hierarchy, the code rate redundancy is high, the storage overhead is greatly increased, the transmission bandwidth requirement is doubled, and the high-efficiency transmission requirement of mobile edge calculation and low-bandwidth scenes is difficult to meet. Meanwhile, in the recently emerging generating video coding method, for example, based on the end-to-end coding framework of the generating countermeasure network or the variation self-encoder, although the compression efficiency is obviously better than that of the traditional scalable video coding method, the generating process is seriously dependent on the preset fixed output resolution, namely, the target size is locked during training, and the dynamic adjustment cannot be performed during reasoning, so that the same code stream can only output single definition video and cannot adapt to the differential display capability of user terminal equipment. More importantly, the method lacks robustness to transmission damage, and once packet loss or error occurs in a code stream in a channel, the reconstructed video is extremely easy to have serious distortion such as blocking effect, texture blurring or space-time inconsistency. In view of the above problems, no effective solution has been proposed at present. Disclosure of Invention The embodiment of the application provides a video coding method and device, a storage medium and electronic equipment, which at least solve the technical problems that the consumption of video transmission resources is high and the coding flexibility is poor because a single independent code stream cannot be coded to obtain multiple resolution videos in the related technology. According to one aspect of the embodiment of the application, a video coding method is provided, which comprises the steps of obtaining a routing data packet sent by a coding end, wherein the routing data packet at least comprises target resolution of a first video and first compression characteristic representation corresponding to the first video, up-sampling the first compression characteristic representation according to the target resolution to obtain first recovery characteristic representation, and analyzing the first recovery characteristic representation by utilizing a pre-trained hybrid model to obtain a plurality of resolution videos corresponding to the first video, wherein the hybrid model comprises a plurality of cascaded video generation models, the input of the first video generation model is the first recovery characteristic representation, and the input of each video generation model after the first video generation model is the resolution video and the first recovery characteristic representation output by the previous video generation model. According to another aspect of the embodiment of the application, a video coding device is provided, which comprises an acquisition module, an up-sampling module and a coding module, wherein the acquisition module is used for acquiring a routing data packet sent by a coding end, the routing data packet at least comprises a target resolution of a first video and a first compression characteristic representation corresponding to the first video, the up-sampling module is used for up-sampling the first compression characteristic representation according to the target resolution to obtain a first recovery characteristic representation, the coding module is used for analyzing the first recovery characteristic representation by utilizing a pre-trained mixed model to obtain a plurality of resolution videos corresponding to the first video, the mixed model comprises a plurality of cascaded video generation models, the input of the first video generation model is the first recovery characteristic representation, and the input of each video generation model after the first video generation model is the resolution video and the first recovery characteristic representation output by the