EP-4742177-A1 - IMAGE PROCESSING METHOD, NON-TRANSITORY STORAGE MEDIUM, AND ELECTRONIC DEVICE

EP4742177A1EP 4742177 A1EP4742177 A1EP 4742177A1EP-4742177-A1

Abstract

The present disclosure provides an image processing method, a non-transitory storage medium, and an electronic device. The image processing method includes: obtaining a frame group to be processed in a target video; and processing the frame group to be processed by a saliency prediction model to obtain a detection result of the target video frame. The saliency prediction model includes a feature encoder and a feature decoder. The processing the frame group to be processed by a saliency prediction model includes: inputting the frame group to be processed into the feature encoder to perform feature extraction, and outputting multi-layer feature data corresponding to the frame group to be processed; and inputting the multi-layer feature data into the feature decoder to perform feature fusion in a temporal dimension and feature fusion in a spatial dimension, and outputting a detection result of the target video frames.

Inventors

YANG, LI
ZHAN, Gen
ZHANG, Yabin
LIAO, Yiting
LI, JUNLIN

Assignees

Beijing Zitiao Network Technology Co., Ltd.
Lemon Inc.

Dates

Publication Date: 20260513
Application Date: 20251106

Claims (15)

An image processing method, comprising: obtaining a frame group to be processed in a target video, the frame group to be processed comprising target video frame and contextual video frames of the target video frame, wherein the contextual video frames comprise at least one first video frame preceding the target video frame and at least one second video frame succeeding the target video frame in the target video, a plurality of video frames in the frame group to be processed are continuous video frames (S110); and processing the frame group to be processed by a saliency prediction model to obtain a detection result of the target video frame, the detection result representing a region of interest in the target video frame (S120), wherein the saliency prediction model comprises a feature encoder and a feature decoder, and wherein the processing the frame group to be processed by a saliency prediction model, comprises: inputting the frame group to be processed into the feature encoder to perform feature extraction, and outputting multi-layer feature data corresponding to the frame group to be processed, wherein the multi-layer feature data comprises layer feature data respectively output by a plurality of preset processing blocks of the feature encoder, and each of the layer feature data comprises feature data of a plurality of video frames in the frame group to be processed; and inputting the multi-layer feature data into the feature decoder to perform feature fusion in a temporal dimension and feature fusion in a spatial dimension, and outputting the detection result of the target video frame.
The image processing method according to claim 1, wherein the feature encoder comprises a visual model comprising a plurality of processing blocks, at least one processing block of the plurality of processing blocks is the preset processing block, and the plurality of preset processing blocks comprises at least one intermediate processing block and a last processing block.
The image processing method according to claim 2, wherein the feature decoder comprises a feature up-sampling module connected to the at least one intermediate processing block of the feature encoder, and wherein the method further comprises: inputting the layer feature data output by the at least one intermediate processing block to the up-sampling module for up-sampling processing, and outputting the layer feature data that is up-sampled, wherein spatial scales of each of the layer feature data that is up-sampled are different.
The image processing method according to any of claims 1 to 3, wherein the feature decoder comprises a temporal attention module connected to an output end of the feature encoder, and wherein the method further comprises inputting layer feature data output by the output end of the feature encoder to the temporal attention module, and outputting temporal weight data, wherein the temporal weight data is configured for fusing the feature data of the plurality of video frames in each of the layer feature data in a temporal dimension.
The image processing method according to claim 4, wherein the temporal attention module comprises a plurality of three-dimensional convolution blocks, a reshape layer, and a mean processing layer.
The image processing method according to any preceding claim, wherein the feature decoder comprises a progressive fusion module, the progressive fusion module comprising a plurality of fusion blocks, each of the fusion blocks comprises a 3D convolution layer and an up-sampling layer, and wherein feature data of a first spatial scale is input to a fusion block pair to perform convolution processing and up-sampling processing so as to output up-sampled data, wherein the up-sampled data is feature data of a second spatial scale, fusion processing is performed on the up-sampled data and layer fusion feature data of the second spatial scale to obtain spatial fusion feature data, wherein the feature data of the first spatial scale comprises layer fusion feature data of the first spatial scale, or spatial fusion feature data of the first spatial scale; and wherein a spatial fusion feature data corresponding to a last fusion block of the progressive fusion module is target fusion feature data.
The image processing method according to claim 6, wherein the feature decoder further comprises a prediction module, the prediction module comprising an up-sampling block and a 2D convolution block connected in sequence, and wherein the target fusion feature data is input to the prediction module so as to output the detection result of the target video frame.
The image processing method according to any preceding claim, further comprising: determining bitrate information about the target video frame based on the detection result of the target video frame; and/or determining bitrate information about each of the video frames in the target video based on a detection result of at least one of the target video frames in the target video.
An image processing apparatus, comprising: a frame group to be processed obtaining module (210), configured for obtaining a frame group to be processed in a target video, the frame group to be processed including target video frame and contextual video frames of the target video frame, where the contextual video frame comprise at least one first video frame preceding the target video frame and at least one second video frame succeeding the target video frame in the target video, a plurality of video frames in the frame group to be processed are continuous video frames; a prediction module (220), configured for processing the frame group to be processed by a saliency prediction model to obtain a detection result of the target video frame, the detection result representing a region of interest in the target video frame, wherein the saliency prediction model comprises a feature encoder and a feature decoder, and the prediction module (220) is configured for inputting the frame group to be processed into the feature encoder to perform feature extraction, and outputting multi-layer feature data corresponding to the frame group to be processed, wherein the multi-layer feature data comprises layer feature data respectively output by a plurality of preset processing blocks of the feature encoder, and each of the layer feature data comprises feature data of a plurality of video frames in the frame group to be processed; and inputting the multi-layer feature data into the feature decoder to perform feature fusion in a temporal dimension and feature fusion in a spatial dimension; and outputting the detection result of the target video frame.
The image processing apparatus according to claim 9, wherein the feature encoder comprises a visual model comprising a plurality of processing blocks, at least one processing block of the plurality of processing blocks is the preset processing block, and the plurality of preset processing blocks comprises at least one intermediate processing block and a last processing block.
The image processing apparatus according to claim 10, wherein the feature decoder comprises a feature up-sampling module connected to the at least one intermediate processing block of the feature encoder, and wherein the method further comprises: inputting the layer feature data output by the at least one intermediate processing block to the up-sampling module for up-sampling processing, and outputting the layer feature data that is up-sampled, wherein spatial scales of each of the layer feature data that is up-sampled are different.
The image processing apparatus according to any of claims 9 to 11, wherein the feature decoder comprises a temporal attention module connected to an output end of the feature encoder, and wherein the method further comprises inputting layer feature data output by the output end of the feature encoder to the temporal attention module, and outputting temporal weight data, wherein the temporal weight data is configured for fusing the feature data of the plurality of video frames in each of the layer feature data in a temporal dimension.
The image processing apparatus according to claim 12, wherein the temporal attention module comprises a plurality of three-dimensional convolution blocks, a reshape layer, and a mean processing layer.
An electronic device (500), comprising: one or more processors (501); and a storage device (508) configured for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the image processing method according to any of claims 1-8.
A non-transitory storage medium comprising computer-executable instructions for implementing the image processing method according to any of claims 1-8 when executed by a computer processor.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application claims the priority to and benefits of the Chinese Patent Application No. 202411580643.8, which was filed on November 06, 2024. The aforementioned patent application is hereby incorporated by reference in its entirety. TECHNICAL FIELD Embodiments of the present disclosure relate to image processing technology, and more particularly, to an image processing method, a non-transitory storage medium, and an electronic device. BACKGROUND By a visual perception system, human beings may quickly locate a region of interest (ROI) in a video picture while watching the video. In order to enable computer systems to quickly perceive important objects in complex scenarios like humans, video saliency prediction tasks come into being. The video saliency information has important application value in many application scenarios. At present, video saliency is generally identified by the bottom-up saliency prediction method or the top-down saliency prediction method. The bottom-up saliency prediction method is based on the characteristics of the video frame data itself, and uses the underlying features such as color, contrast and edge to select the regions in the scenario that are significantly compared with the surrounding regions, i.e., saliency regions. Top-down saliency prediction is based on human's "cognitive factors" as the premise, using knowledge-driven features to predict, such as face, car, moving objects in the video scenario as the main target of saliency region. In the above-mentioned saliency prediction method, it depends on the manually designed features, which require expert knowledge and a large number of experiments to extract and adjust. The final features are not necessarily the features that generate the best saliency results, and there is a problem of poor recognition accuracy. SUMMARY The present disclosure provides an image processing method, a non-transitory storage medium, and an electronic device, so as to reduce dependence on manual knowledge and improve the accuracy of image saliency recognition results. Embodiments of the present disclosure provide an image processing method, which includes: obtaining a frame group to be processed in a target video, the frame group to be processed including target video frame and contextual video frames of the target video frame, where the contextual video frames include at least one first video frame preceding the target video frame and at least one second video frame succeeding the target video frame in the target video, a plurality of video frames in the frame group to be processed are continuous video frames; andprocessing the frame group to be processed by a saliency prediction model to obtain a detection result of the target video frame, the detection result representing a region of interest in the target video frame,where the saliency prediction model includes a feature encoder and a feature decoder, and the processing the frame group to be processed by a saliency prediction model includes:inputting the frame group to be processed into the feature encoder to perform feature extraction, and outputting multi-layer feature data corresponding to the frame group to be processed, where the multi-layer feature data includes layer feature data respectively output by a plurality of preset processing blocks of the feature encoder, and each of the layer feature data includes feature data of a plurality of video frames in the frame group to be processed; andinputting the multi-layer feature data into the feature decoder to perform feature fusion in a temporal dimension and feature fusion in a spatial dimension, and outputting the detection result of the target video frames. Embodiments of the present disclosure provide an image processing apparatus, which includes: a frame group to be processed obtaining module, which is configured for obtaining a frame group to be processed in a target video, the frame group to be processed including target video frame and contextual video frames of the target video frame, where the contextual video frames include at least one first video frame preceding the target video frame and at least one second video frame succeeding the target video frame in the target video, a plurality of video frames in the frame group to be processed are continuous video frames;a prediction module, configured for processing the frame group to be processed by a saliency prediction model to obtain a detection result of the target video frame, the detection result representing a region of interest in the target video frame,where the saliency prediction model includes a feature encoder and a feature decoder, The prediction module is configured for inputting the frame group to be processed into the feature encoder to perform feature extraction, and outputting multi-layer feature data corresponding to the frame group to be processed, where the multi-layer feature data includes layer feature data respectively