US-20260127720-A1 - MACHINE LEARNING-BASED VIDEO DENOISING WITH ADAPTIVE FUSION

US20260127720A1US 20260127720 A1US20260127720 A1US 20260127720A1US-20260127720-A1

Abstract

A method includes obtaining, using at least one processing device of an electronic device, a video having a sequence of image frames. During each of multiple iterations, the method also includes identifying, using the at least one processing device, features of a specified image frame in the sequence and features of a denoised version of a preceding image frame. During each of the multiple iterations, the method further includes aligning, using the at least one processing device, the features of the specified image frame in the sequence and the features of the denoised version of the preceding image frame to generate aligned features. In addition, during each of the multiple iterations, the method includes generating, using the at least one processing device, a denoised version of the specified image frame based on the aligned features.

Inventors

Hakan Emre Gedik
Zhiyuan Mao
John W. Glotzbach
John Seokjun Lee
Hamid R. Sheikh

Assignees

SAMSUNG ELECTRONICS CO., LTD.

Dates

Publication Date: 20260507
Application Date: 20241106

Claims (20)

1 . A method comprising: obtaining, using at least one processing device of an electronic device, a video comprising a sequence of image frames; and during each of multiple iterations: identifying, using the at least one processing device, features of a specified image frame in the sequence and features of a denoised version of a preceding image frame; aligning, using the at least one processing device, the features of the specified image frame in the sequence and the features of the denoised version of the preceding image frame to generate aligned features; and generating, using the at least one processing device, a denoised version of the specified image frame based on the aligned features.
2 . The method of claim 1 , further comprising: generating an initial denoised image frame based on first and second image frames in the sequence; wherein the iterations are performed to denoise subsequent image frames in the sequence after the first and second image frames, wherein each subsequent image frame is denoised using the denoised version of the preceding image frame.
3 . The method of claim 1 , wherein, during each iteration, aligning the features of the specified image frame in the sequence and the features of the denoised version of the preceding image frame comprises: down-sampling filtered versions of the specified image frame and the preceding image frame to generate down-sampled image frames; performing optical flow estimation based on the down-sampled image frames to generate an optical flow; up-sampling the optical flow to generate an up-sampled optical flow; and warping at least some of the features based on the up-sampled optical flow.
4 . The method of claim 1 , wherein, during each iteration, generating the denoised version of the specified image frame comprises: performing a fusion of the aligned features to generate fused features; and performing decoding of the fused features to generate the denoised version of the specified image frame.
5 . The method of claim 4 , wherein, during each iteration, performing the fusion of the aligned features comprises: generating an attention map based on the aligned features using a trained machine learning model, the attention map identifying how to fuse the aligned features; and performing a weighted combination of the features of the denoised version of the preceding image frame as aligned with the features of the specified image frame in the sequence.
6 . The method of claim 4 , wherein, during each iteration, performing the fusion of the aligned features comprises: embedding the aligned features in a first feature space using single-layer convolution; identifying pixel-wise correlations between the aligned features as embedded in the first feature space to generate a correlation map; embedding the correlation map in a second feature space; applying an activation function to the correlation map as embedded in the second feature space to generate an attention map, the attention map identifying how to fuse the aligned features; and performing a weighted combination of the features of the denoised version of the preceding image frame as aligned with the features of the specified image frame in the sequence.
7 . The method of claim 1 , wherein processing the specified image frames in conjunction with the denoised versions of the preceding image frames during the iterations ensures temporal coherence in a denoised video sequence comprising the denoised versions of the image frames, thereby avoiding flickering in the denoised video sequence.
8 . An apparatus comprising: at least one processing device configured to: obtain a video comprising a sequence of image frames; and during each of multiple iterations: identify features of a specified image frame in the sequence and features of a denoised version of a preceding image frame; align the features of the specified image frame in the sequence and the features of the denoised version of the preceding image frame to generate aligned features; and generate a denoised version of the specified image frame based on the aligned features.
9 . The apparatus of claim 8 , wherein: the at least one processing device is further configured to generate an initial denoised image frame based on first and second image frames in the sequence; the at least one processing device is configured to perform the iterations to denoise subsequent image frames in the sequence after the first and second image frames, wherein each subsequent image frame is denoised using the denoised version of the preceding image frame.
10 . The apparatus of claim 8 , wherein, to align the features of the specified image frame in the sequence and the features of the denoised version of the preceding image frame during each iteration, the at least one processing device is configured to: down-sample filtered versions of the specified image frame and the preceding image frame to generate down-sampled image frames; perform optical flow estimation based on the down-sampled image frames to generate an optical flow; up-sample the optical flow to generate an up-sampled optical flow; and warp at least some of the features based on the up-sampled optical flow.
11 . The apparatus of claim 8 , wherein, to generate the denoised version of the specified image frame during each iteration, the at least one processing device is configured to: perform a fusion of the aligned features to generate fused features; and perform decoding of the fused features to generate the denoised version of the specified image frame.
12 . The apparatus of claim 11 , wherein, to perform the fusion of the aligned features during each iteration, the at least one processing device is configured to: generate an attention map based on the aligned features using a trained machine learning model, the attention map identifying how to fuse the aligned features; and perform a weighted combination of the features of the denoised version of the preceding image frame as aligned with the features of the specified image frame in the sequence.
13 . The apparatus of claim 11 , wherein, to perform the fusion of the aligned features during each iteration, the at least one processing device is configured to: embed the aligned features in a first feature space using single-layer convolution; identify pixel-wise correlations between the aligned features as embedded in the first feature space to generate a correlation map; embed the correlation map in a second feature space; apply an activation function to the correlation map as embedded in the second feature space to generate an attention map, the attention map identifying how to fuse the aligned features; and perform a weighted combination of the features of the denoised version of the preceding image frame as aligned with the features of the specified image frame in the sequence.
14 . The apparatus of claim 8 , wherein the at least one processing device is configured to process the specified image frames in conjunction with the denoised versions of the preceding image frames during the iterations to ensure temporal coherence in a denoised video sequence comprising the denoised versions of the image frames, thereby avoiding flickering in the denoised video sequence.
15 . A non-transitory machine-readable medium containing instructions that when executed cause at least one processor to: obtain a video comprising a sequence of image frames; and during each of multiple iterations: identify features of a specified image frame in the sequence and features of a denoised version of a preceding image frame; align the features of the specified image frame in the sequence and the features of the denoised version of the preceding image frame to generate aligned features; and generate a denoised version of the specified image frame based on the aligned features.
16 . The non-transitory machine-readable medium of claim 15 , further containing instructions that when executed cause the at least one processor to generate an initial denoised image frame based on first and second image frames in the sequence; wherein the instructions when executed cause the at least one processor to perform the iterations to denoise subsequent image frames in the sequence after the first and second image frames, wherein each subsequent image frame is denoised using the denoised version of the preceding image frame.
17 . The non-transitory machine-readable medium of claim 15 , wherein the instructions that when executed cause the at least one processor to align the features of the specified image frame in the sequence and the features of the denoised version of the preceding image frame during each iteration comprise: instructions that when executed cause the at least one processor to: down-sample filtered versions of the specified image frame and the preceding image frame to generate down-sampled image frames; perform optical flow estimation based on the down-sampled image frames to generate an optical flow; up-sample the optical flow to generate an up-sampled optical flow; and warp at least some of the features based on the up-sampled optical flow.
18 . The non-transitory machine-readable medium of claim 15 , wherein the instructions that when executed cause the at least one processor to generate the denoised version of the specified image frame during each iteration comprise: instructions that when executed cause the at least one processor to: perform a fusion of the aligned features to generate fused features; and perform decoding of the fused features to generate the denoised version of the specified image frame.
19 . The non-transitory machine-readable medium of claim 18 , wherein the instructions that when executed cause the at least one processor to perform the fusion of the aligned features during each iteration comprise: instructions that when executed cause the at least one processor to: generate an attention map based on the aligned features using a trained machine learning model, the attention map identifying how to fuse the aligned features; and perform a weighted combination of the features of the denoised version of the preceding image frame as aligned with the features of the specified image frame in the sequence.
20 . The non-transitory machine-readable medium of claim 18 , wherein the instructions that when executed cause the at least one processor to perform the fusion of the aligned features during each iteration comprise: instructions that when executed cause the at least one processor to: embed the aligned features in a first feature space using single-layer convolution; identify pixel-wise correlations between the aligned features as embedded in the first feature space to generate a correlation map; embed the correlation map in a second feature space; apply an activation function to the correlation map as embedded in the second feature space to generate an attention map, the attention map identifying how to fuse the aligned features; and perform a weighted combination of the features of the denoised version of the preceding image frame as aligned with the features of the specified image frame in the sequence.

Description

TECHNICAL FIELD This disclosure relates generally to image processing. More specifically, this disclosure relates to machine learning-based video denoising with adaptive fusion. BACKGROUND Many mobile electronic devices, such as smartphones and tablet computers, include digital cameras that can be used to capture still and video images. In some cases, raw image data captured using digital cameras typically undergoes various processing operations that may collectively be referred to as an image signal processing (ISP) pipeline, which generates final images that can be stored or displayed. One common operation in an ISP pipeline is denoising, which aims to reduce the severity of noise that is inherent in the imaging process or that is introduced during preceding processing operations in the ISP pipeline. The quality of the noise reduction directly impacts the perceptual quality of the resulting images. SUMMARY This disclosure relates to machine learning-based video denoising with adaptive fusion. In a first embodiment, a method includes obtaining, using at least one processing device of an electronic device, a video having a sequence of image frames. During each of multiple iterations, the method also includes identifying, using the at least one processing device, features of a specified image frame in the sequence and features of a denoised version of a preceding image frame. During each of the multiple iterations, the method further includes aligning, using the at least one processing device, the features of the specified image frame in the sequence and the features of the denoised version of the preceding image frame to generate aligned features. In addition, during each of the multiple iterations, the method includes generating, using the at least one processing device, a denoised version of the specified image frame based on the aligned features. In a second embodiment, an apparatus includes at least one processing device configured to obtain a video having a sequence of image frames. The at least one processing device is also configured, during each of multiple iterations, to identify features of a specified image frame in the sequence and features of a denoised version of a preceding image frame, align the features of the specified image frame in the sequence and the features of the denoised version of the preceding image frame to generate aligned features, and generate a denoised version of the specified image frame based on the aligned features. In a third embodiment, a non-transitory machine-readable medium contains instructions that when executed cause at least one processor to obtain a video having a sequence of image frames. The non-transitory machine-readable medium also contains instructions that when executed cause the at least one processor, during each of multiple iterations, to identify features of a specified image frame in the sequence and features of a denoised version of a preceding image frame, align the features of the specified image frame in the sequence and the features of the denoised version of the preceding image frame to generate aligned features, and generate a denoised version of the specified image frame based on the aligned features. Any one or any combination of the following features may be used with the first, second, or third embodiment. An initial denoised image frame may be generated based on first and second image frames in the sequence. The iterations may be performed to denoise subsequent image frames in the sequence after the first and second image frames, and each subsequent image frame may be denoised using the denoised version of the preceding image frame. During each iteration, the features of the specified image frame in the sequence and the features of the denoised version of the preceding image frame may be aligned by down-sampling filtered versions of the specified image frame and the preceding image frame to generate down-sampled image frames, performing optical flow estimation based on the down-sampled image frames to generate an optical flow, up-sampling the optical flow to generate an up-sampled optical flow, and warping at least some of the features based on the up-sampled optical flow. During each iteration, the denoised version of the specified image frame may be generated by performing a fusion of the aligned features to generate fused features and performing decoding of the fused features to generate the denoised version of the specified image frame. During each iteration, the fusion of the aligned features may be performed by generating an attention map based on the aligned features using a trained machine learning model (where the attention map identifies how to fuse the aligned features) and performing a weighted combination of the features of the denoised version of the preceding image frame as aligned with the features of the specified image frame in the sequence. During each iteration, the fusion of the aligned features may be performed by e