CN-121999423-A - Image processing method and device, storage medium and electronic equipment

CN121999423ACN 121999423 ACN121999423 ACN 121999423ACN-121999423-A

Abstract

The embodiment of the disclosure provides an image processing method, an image processing device, a storage medium and electronic equipment. The method comprises the steps of obtaining a frame group to be processed in a target video, processing the frame group to be processed through a saliency prediction model to obtain a detection result of the target video frame, wherein the saliency prediction model comprises a feature encoder and a feature decoder, inputting the frame group to be processed into the feature encoder to perform feature extraction, outputting multi-layer feature data corresponding to the frame group to be processed, wherein the multi-layer feature data comprises layer feature data respectively output by a plurality of preset processing blocks of the feature encoder, each layer of feature data comprises feature data of a plurality of video frames in the frame group to be processed, inputting the multi-layer feature data into the feature decoder to perform feature fusion of time sequence dimension and feature fusion of space dimension, and outputting the detection result of the target video frame. And the feature understanding of the frame group to be processed in the time sequence dimension and the space dimension is improved through the significance prediction model, and the accuracy of the detection result is improved.

Inventors

YANG LI
Zhan Gen
ZHANG YABIN
LIAO YITING
LI JUNLIN

Assignees

北京字跳网络技术有限公司
脸萌有限公司

Dates

Publication Date: 20260508
Application Date: 20241106

Claims (11)

1. An image processing method, comprising: Acquiring a frame group to be processed in a target video, wherein the context video frames comprise at least one first video frame positioned before the target video frame and at least one second video frame positioned after the target video frame in the target video; Processing the frame group to be processed through a significance prediction model to obtain a detection result of the target video frame, wherein the detection result represents a target area in the target video frame; wherein the significance prediction model comprises a feature encoder and a feature decoder; Inputting the frame group to be processed into the feature encoder for feature extraction, and outputting multi-layer feature data corresponding to the frame group to be processed, wherein the multi-layer feature data comprises layer feature data respectively output by a plurality of preset processing blocks of the feature encoder, and each layer feature data comprises feature data of a plurality of video frames in the frame group to be processed; and inputting the multi-layer characteristic data to the characteristic decoder to perform characteristic fusion of time sequence dimension and characteristic fusion of space dimension, and outputting a detection result of the target video frame.
2. The method of claim 1, wherein the feature encoder comprises a visual model comprising a plurality of processing blocks, at least one of the plurality of processing blocks being the preset processing block, the plurality of preset processing blocks comprising at least one intermediate processing block and a last processing block.
3. The method of claim 2, wherein the feature decoder comprises a feature upsampling module coupled to the at least one intermediate processing block of the feature encoder; The method further comprises the steps of: And inputting the layer characteristic data output by the at least one intermediate processing block into the up-sampling module for up-sampling processing, and outputting the up-sampled layer characteristic data, wherein the spatial scales of the up-sampled layer characteristic data are different.
4. A method according to claim 1 or 3, wherein the feature decoder comprises a time sequence attention module, the time sequence attention module being connected to an output of the feature encoder; the method further comprises the steps of inputting layer characteristic data output by the output end of the characteristic encoder to the time sequence attention module, and outputting time sequence weight data, wherein the time sequence weight data is used for fusing characteristic data of the plurality of video frames in each layer characteristic data in a time sequence dimension.
5. The method of claim 4, wherein the temporal attention module comprises a plurality of three-dimensional convolution blocks, a remodelling layer, and a mean processing layer.
6. The method of claim 1, wherein the feature decoder comprises a progressive fusion module comprising a plurality of fusion blocks, each fusion block comprising a 3D convolutional layer and an upsampling layer; the method comprises the steps of inputting feature data of a first spatial scale to a fusion block pair to perform convolution processing and up-sampling processing, and outputting up-sampling data, wherein the up-sampling data is feature data of a second spatial scale; the feature data of the first spatial scale comprises layer fusion feature data of the first spatial scale or spatial fusion feature data of the first spatial scale; And the spatial fusion characteristic data corresponding to the last fusion block of the progressive fusion module is target fusion characteristic data.
7. The method of claim 6, wherein the feature decoder further comprises a prediction module comprising an up-sample block and a 2D convolution block connected in sequence; and inputting the target fusion characteristic data to the prediction module, and outputting a detection result of the target video frame.
8. The method according to claim 1, wherein the method further comprises: determining code rate information of the target video frame based on the detection result of the target video frame, and/or, And determining code rate information of each video frame in the target video based on the detection result of at least one target video frame in the target video.
9. An image processing apparatus, comprising: A frame group to be processed acquisition module, configured to acquire a frame group to be processed in a target video, where the frame group to be processed includes a target video frame and a context video frame of the target video frame, where the context video frame includes at least one first video frame located before the target video frame and at least one second video frame located after the target video frame in the target video; The prediction module is used for processing the frame group to be processed through a significance prediction model to obtain a detection result of the target video frame, and the detection result represents a target area in the target video frame; wherein the significance prediction model comprises a feature encoder and a feature decoder; The prediction module is used for inputting the frame group to be processed into the feature encoder to perform feature extraction, outputting multi-layer feature data corresponding to the frame group to be processed, wherein the multi-layer feature data comprises layer feature data respectively output by a plurality of preset processing blocks of the feature encoder, each layer feature data comprises feature data of a plurality of video frames in the frame group to be processed, inputting the multi-layer feature data into the feature decoder to perform feature fusion of time sequence dimension and feature fusion of space dimension, and outputting a detection result of the target video frame.
10. An electronic device, the electronic device comprising: One or more processors; Storage means for storing one or more programs, The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the image processing method of any of claims 1-8.
11. A storage medium containing computer executable instructions which, when executed by a computer processor, are for performing the image processing method of any of claims 1-8.

Description

Image processing method and device, storage medium and electronic equipment Technical Field The embodiment of the disclosure relates to an image processing technology, in particular to an image processing method, an image processing device, a storage medium and electronic equipment. Background A human being can quickly locate a target region (Region of Interest, ROI) in a video picture while watching the video through a visual perception system. Video saliency prediction tasks have evolved to enable computer systems to quickly perceive important objects in complex scenes like humans. The video saliency information has important application value in a plurality of application scenes. At present, the video saliency is generally identified by a bottom-up saliency prediction mode or a top-down saliency prediction mode. The bottom-up saliency prediction mode is based on the characteristics of video frame data, and the regions with obvious contrast with surrounding regions in a scene are screened by utilizing bottom features such as color, contrast, edges and the like, namely the saliency regions. The top-down significance prediction mode is guided by the premise of 'cognitive factors' of human beings, and features under knowledge drive are utilized for prediction, such as faces, automobiles, moving objects and the like in video scenes are regarded as main targets of significance areas. In the above-mentioned saliency prediction method, depending on the characteristics of the manual design, the characteristics need to be extracted and adjusted by expert knowledge and a large number of experiments, and the final characteristics are not necessarily the characteristics generating the optimal saliency result, and there is a problem of poor recognition accuracy. Disclosure of Invention The disclosure provides an image processing method, an image processing device, a storage medium and electronic equipment, so that dependence on manual knowledge is reduced, and accuracy of a significant identification result of an image is improved. In a first aspect, an embodiment of the present disclosure provides an image processing method, including: acquiring a frame group to be processed in a target video, wherein the frame group to be processed comprises a target video frame and a context video frame of the target video frame, and the context video frame comprises at least one first video frame positioned before the target video frame and at least one second video frame positioned after the target video frame in the target video; Processing the frame group to be processed through a significance prediction model to obtain a detection result of the target video frame, wherein the detection result represents a target area in the target video frame; The significance prediction model comprises a feature encoder and a feature decoder, the frame group to be processed is input to the feature encoder for feature extraction, multi-layer feature data corresponding to the frame group to be processed is output, the multi-layer feature data comprises layer feature data respectively output by a plurality of preset processing blocks of the feature encoder, and each layer feature data comprises feature data of a plurality of video frames in the frame group to be processed; and inputting the multi-layer characteristic data to the characteristic decoder to perform characteristic fusion of time sequence dimension and characteristic fusion of space dimension, and outputting a detection result of the target video frame. In a second aspect, an embodiment of the present disclosure further provides an image processing apparatus, including: A frame group to be processed acquisition module, configured to acquire a frame group to be processed in a target video, where the frame group to be processed includes a target video frame and a context video frame of the target video frame, where the context video frame includes at least one first video frame located before the target video frame and at least one second video frame located after the target video frame in the target video; The prediction module is used for processing the frame group to be processed through a significance prediction model to obtain a detection result of the target video frame, and the detection result represents a target area in the target video frame; wherein the significance prediction model comprises a feature encoder and a feature decoder; The prediction module is used for inputting the frame group to be processed into the feature encoder to perform feature extraction, outputting multi-layer feature data corresponding to the frame group to be processed, wherein the multi-layer feature data comprises layer feature data respectively output by a plurality of preset processing blocks of the feature encoder, each layer feature data comprises feature data of a plurality of video frames in the frame group to be processed, inputting the multi-layer feature data into the feature decoder to