CN-122027825-A - Video compression method and video decoding method

CN122027825ACN 122027825 ACN122027825 ACN 122027825ACN-122027825-A

Abstract

The application discloses a video compression method and a video decoding method. The method comprises the steps of obtaining a screen video frame sequence, determining key frames from the screen video frame sequence and potential representations corresponding to the screen video frame sequence, wherein the screen video frame sequence is used for reflecting image user interface content generated by computer rendering, the key frames comprise video frames, the semantic change strength of the video frames meets preset conditions, in the screen video frame sequence, the potential representations are used for representing semantic structure information of the image user interface content, encoding the key frames and the potential representations to obtain a compressed bit stream, and transmitting the compressed bit stream to a decoding end, wherein the compressed bit stream is used for reconstructing target video frames corresponding to the screen video frame sequence. The application solves the technical problem that the video compression technology adopted by the related technology cannot effectively extract the structural characteristics in the screen content, so that the compression efficiency is low.

Inventors

LI XUELONG
ZHANG CHI
CHEN XIANGYU

Assignees

中国电信股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (14)

1. A method of video compression, comprising: Acquiring a screen video frame sequence, wherein the screen video frame sequence is used for reflecting image user interface content generated by computer rendering; Determining a key frame from the screen video frame sequence and determining a potential representation corresponding to the screen video frame sequence, wherein the key frame comprises a video frame with semantic change intensity meeting a preset condition in the screen video frame sequence, and the potential representation is used for representing semantic structure information of the image user interface content; Encoding the key frame and the potential representation to obtain a compressed bit stream; And transmitting the compressed bit stream to a decoding end, wherein the compressed bit stream is used for reconstructing a target video frame corresponding to the screen video frame sequence.
2. The method of claim 1, wherein determining a potential representation corresponding to the sequence of screen video frames comprises: extracting global semantic feature vectors from the screen video frame sequence, wherein the global semantic feature vectors are at least used for representing content information of the key frames; Extracting edge structure feature vectors from the sequence of screen video frames, wherein the edge structure feature vectors are used for describing edge contours of the image user interface content; Identifying a repeating region in the sequence of screen video frames, wherein the repeating region is used to characterize a repeating structural feature in the graphical user interface content; The global semantic feature vector, the edge structure feature vector, and the duplicate region are taken as the potential representation.
3. The method of claim 2, wherein the potential representation is encoded by: Adopting neural self-adaptive coding to the global semantic feature vector in the potential representation to obtain a first bit stream, wherein the first bit stream is used for reflecting semantic information of the image user interface content; Residual error coding is adopted for the edge structure feature vectors in the potential representation, so that a second bit stream is obtained, wherein the second bit stream is used for reflecting the edge structure variation among continuous video frames in the screen video frame sequence; And index coding is adopted for the repeated area in the potential representation, so that a third bit stream is obtained, wherein the third bit stream is used for reflecting the repeated user interface control in the screen video frame sequence.
4. The method according to claim 2, wherein the method further comprises: Dividing a space region corresponding to each video frame in the screen video frame sequence into a first priority region and a second priority region, wherein the first priority region comprises a structure boundary region for bearing human-computer interaction semantics, and the second priority region comprises a background or static region; The potential representation is determined from the first priority region.
5. The method of claim 1, wherein determining key frames from the sequence of screen video frames comprises: determining a structural integrity score corresponding to each candidate video frame in the sequence of screen video frames, wherein the structural integrity score is used for quantitatively representing the integrity degree of a user interface control outline in the image user interface content; Determining a semantic change score between the candidate video frame and a previous candidate video frame, wherein the semantic change score is used for quantifying the change strength of semantic content representing a text region in the graphical user interface content; Determining a composite score based on the structural integrity score and the semantic change score; And determining the candidate video frames with the comprehensive scores meeting a preset threshold as the key frames.
6. The method of claim 5, wherein the method further comprises: And under the condition that none of the preset number of continuous candidate video frames is determined to be the key frame, determining the candidate video frame with the highest comprehensive score in the preset number of continuous candidate video frames as the key frame.
7. The method of claim 1, wherein the target video frame is reconstructed using key frames in the compressed bitstream as spatial reference anchor points for providing real pixel constraints for reconstruction at preset timing points and using the potential representations as constraints for guiding the generated target video frame to match semantic features of the graphical user interface content.
8. A method of video decoding, comprising: Receiving a compressed bit stream corresponding to a screen video frame sequence, wherein the compressed bit stream comprises a key frame corresponding to the screen video frame sequence and potential representation, the key frame comprises a video frame with semantic change intensity meeting preset conditions in the screen video frame sequence, and the potential representation is used for representing semantic structure information of image user interface content in the screen video frame sequence; and reconstructing the compressed bit stream by adopting a video generation model to obtain a target video frame.
9. The method of claim 8, wherein reconstructing the compressed bitstream using a video generation model results in a target video frame, comprising: obtaining an initial reconstructed video frame by adopting the video generation model and taking a key frame in the compressed bit stream as a space reference anchor point, wherein the initial reconstructed video frame is used for reflecting the interface layout of the key frame and the structural constraint of the control position; And reconstructing the target video frame by adopting the video generation model and taking the potential representation as a constraint condition on the basis of the initial reconstructed video frame, wherein the constraint condition is used for reflecting semantic constraint of the image user interface content.
10. An apparatus for video compression, comprising: The system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a screen video frame sequence, and the screen video frame sequence is used for reflecting image user interface content generated by computer rendering; A determining module, configured to determine a key frame from the sequence of screen video frames, and determine a potential representation corresponding to the sequence of screen video frames, where the key frame includes a video frame in the sequence of screen video frames whose semantic change strength meets a preset condition, and the potential representation is used to characterize semantic structure information of the content of the image user interface; an encoding module, configured to encode the key frame and the potential representation to obtain a compressed bitstream; And the transmission module is used for transmitting the compressed bit stream to a decoding end, wherein the compressed bit stream is used for reconstructing a target video frame corresponding to the screen video frame sequence.
11. An apparatus for video decoding, comprising: the device comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a compressed bit stream corresponding to a screen video frame sequence, wherein the compressed bit stream comprises a key frame corresponding to the screen video frame sequence and potential representation, the key frame comprises a video frame with semantic change strength meeting preset conditions in the screen video frame sequence, and the potential representation is used for representing semantic structure information of image user interface content in the screen video frame sequence; And the reconstruction module is used for reconstructing the compressed bit stream by adopting a video generation model to obtain a target video frame.
12. An electronic device comprising a memory for storing program instructions and a processor coupled to the memory for performing the method of performing video compression as claimed in any one of claims 1 to 7 or for performing the method of video decoding as claimed in any one of claims 8 to 9.
13. A non-volatile storage medium, characterized in that the non-volatile storage medium comprises a stored computer program, wherein the device in which the non-volatile storage medium is located performs the method of video compression according to any one of claims 1 to 7 or performs the method of video decoding according to any one of claims 8 to 9 by running the computer program.
14. A computer program product comprising computer instructions which, when executed by a processor, implement the method of video compression of any one of claims 1 to 7 or perform the method of video decoding of any one of claims 8 to 9.

Description

Video compression method and video decoding method Technical Field The application relates to the technical field of video processing, in particular to a video compression method and a video decoding method. Background In screen sharing applications such as remote desktop, cloud games, video conferences, etc., the video transmission quality and efficiency of screen content directly affect the user experience. Because of the significant difference between the statistical characteristics of the screen content and the natural video, the conventional video compression technology (such as h.264/AVC, h.265/HEVC, AV1, etc.) mainly relies on the compression of image redundant information, and its design logic generally assumes that the image content has continuous textures and relaxed edges, contrary to the characteristics of high contrast edges, repeated character structures and flat areas in the screen content, which results in the conventional encoding method not being able to effectively extract structural features and low compression efficiency. In view of the above problems, no effective solution has been proposed at present. Disclosure of Invention The embodiment of the application provides a video compression method and a video decoding method, which at least solve the technical problem that the compression efficiency is low because the structural characteristics in screen contents cannot be effectively extracted by a video compression technology adopted in the related technology. According to one aspect of the embodiment of the application, a video compression method is provided, which comprises the steps of obtaining a screen video frame sequence, determining key frames from the screen video frame sequence and potential representations corresponding to the screen video frame sequence, wherein the screen video frame sequence is used for reflecting image user interface content generated by computer rendering, the key frames comprise video frames with semantic change intensities meeting preset conditions in the screen video frame sequence, the potential representations are used for representing semantic structure information of the image user interface content, encoding the key frames and the potential representations to obtain a compressed bit stream, and transmitting the compressed bit stream to a decoding end, wherein the compressed bit stream is used for reconstructing target video frames corresponding to the screen video frame sequence. In some embodiments of the application, determining a potential representation corresponding to a sequence of screen video frames includes extracting global semantic feature vectors from the sequence of screen video frames, wherein the global semantic feature vectors are used to characterize at least content information of key frames, extracting edge structure feature vectors from the sequence of screen video frames, wherein the edge structure feature vectors are used to describe edge contours of image user interface content, identifying duplicate regions in the sequence of screen video frames, wherein the duplicate regions are used to characterize the duplicate structural features in the image user interface content, and using the global semantic feature vectors, the edge structure feature vectors, and the duplicate regions as potential representations. In some embodiments of the application, the potential representation is encoded by applying neural adaptive encoding to global semantic feature vectors in the potential representation to obtain a first bit stream, wherein the first bit stream is used to reflect semantic information of the content of the graphical user interface, applying residual encoding to edge structure feature vectors in the potential representation to obtain a second bit stream, wherein the second bit stream is used to reflect edge structure variations between successive video frames in the sequence of screen video frames, and index encoding to a repetition region in the potential representation to obtain a third bit stream, wherein the third bit stream is used to reflect user interface controls that repeatedly appear in the sequence of screen video frames. In some embodiments of the application, the method further comprises dividing the spatial region corresponding to each video frame in the screen video frame sequence into a first priority region and a second priority region, wherein the first priority region comprises a structural boundary region for carrying human-computer interaction semantics, the second priority region comprises a background or static region, and determining the potential representation from the first priority region. In some embodiments of the application, determining a key frame from a sequence of screen video frames includes determining a structural integrity score corresponding to each candidate video frame in the sequence of screen video frames, wherein the structural integrity score is used to quantify the degree of int