CN-121982056-A - Image processing method, apparatus, device, medium, and computer program product

CN121982056ACN 121982056 ACN121982056 ACN 121982056ACN-121982056-A

Abstract

The present disclosure provides an image processing method, an image processing apparatus and device, a computer-readable storage medium, and a computer program product. The image processing method comprises the steps of receiving an input image pair comprising an original image and a background image corresponding to a background area in the original image, extracting features of the input image pair to obtain a first feature vector representing the original image and a second feature vector representing the background image respectively, performing feature matching on the first feature vector and the second feature vector based on similarity of feature content to calculate a front Jing Xiangliang representing a foreground area of the original image, and generating a transparency mask of the original image based on the decoded front Jing Xiangliang, wherein a value of each element in the transparency mask indicates probability that a corresponding pixel in the original image belongs to the foreground area.

Inventors

HE YUANJIAN
ZHANG CHEN
LI ZHI
CHEN FASHENG
CAO JIANGBO

Assignees

腾讯科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260119

Claims (20)

1. An image processing method, comprising: Receiving an input image pair, the input image pair comprising an original image and a background image corresponding to a background region in the original image; Extracting features of the input image pair to obtain a first feature vector representing the original image and a second feature vector representing the background image respectively; Performing feature matching based on feature content similarity on the first feature vector and the second feature vector to calculate a front Jing Xiangliang representing a foreground region of the original image, and The front Jing Xiangliang is decoded and a transparency mask for the original image is generated based on the decoded front Jing Xiangliang, the value of each element in the transparency mask indicating the probability that the corresponding pixel in the original image belongs to a foreground region.
2. The method of claim 1, wherein feature matching the first feature vector and the second feature vector based on feature content similarity comprises: Feature matching based on feature content similarity is performed on the first feature vector and the second feature vector with a plurality of attention layers, each of the plurality of attention layers including at least a cross-attention layer.
3. The method of claim 2, wherein the feature matching comprises performing cross-attention processing with the cross-attention layer based on a key vector and a value vector from a first feature vector of the original image and a second feature vector of the background image.
4. The method of claim 2, further comprising: Before the plurality of attention layers, feature-aligning the first feature vector and the second feature vector to semantically align the first feature vector and the second feature vector.
5. The method of claim 1, further comprising feature upsampling the original image to obtain a high frequency information component in the original image, Wherein decoding the foreground vector comprises: decoding the front Jing Xiangliang with at least one decoder layer to generate an initial decoded front Jing Xiangliang, and The decoded foreground vector is generated based on the initial decoded front Jing Xiangliang and the high-frequency information component.
6. The method of claim 5, wherein feature upsampling the original image to obtain high frequency information components in the original image comprises: generating a query vector and a key vector based on the original image, and generating a value vector based on the first feature vector; And performing cross attention processing by using the query vector, the key vector and the value vector to acquire a high-frequency information component in the original image.
7. The method of claim 6, wherein generating a key vector based on the original image further comprises: generating an initial key vector based on the original image, and And performing space feature transformation modulation on the initial key vector by using the first feature vector as a condition to generate the key vector.
8. The method of claim 5, wherein generating the decoded foreground vector based on the initial decoded front Jing Xiangliang and the high frequency information component comprises: splicing the initial pre-decoded Jing Xiangliang and the high-frequency information component to generate a spliced vector, and And carrying out fusion processing on the spliced vectors, and taking the fused spliced vectors as the decoded foreground vectors.
9. The method of claim 1, wherein decoding the foreground vector comprises: progressive decoding of the front Jing Xiangliang to progressively increase the resolution of the front Jing Xiangliang to generate the decoded front Jing Xiangliang, Wherein progressively decoding the front Jing Xiangliang comprises: the front Jing Xiangliang is progressively decoded with at least one decoder layer, each of which includes an upsampling layer and a convolution refinement layer.
10. The method of claim 9, wherein progressively decoding the front Jing Xiangliang with at least one decoder layer comprises: The output of a first decoder layer of the at least one decoder layer is spliced and fused with the first feature vector and then input to a second decoder layer of the at least one decoder layer for decoding.
11. The method of claim 1, wherein feature extracting the input image pair comprises: Extracting features of the original image and the background image respectively by using a feature extraction model, or And carrying out feature extraction on the original image by using a first feature extraction model, and carrying out feature extraction on the background image by using a second feature extraction model, wherein the first feature extraction model and the second feature extraction model share the same weight.
12. The method of claim 11, wherein the feature extraction model comprises a plurality of transformer layers, each of the plurality of transformer layers comprising a multi-headed self-attention layer, a multi-layer perceptron, and layer normalization.
13. The method of claim 1, wherein the image processing method is performed using an image processing model including a feature extraction model for performing the feature extraction, a feature matching model for performing the feature matching, and a decoder for performing the decoding, Wherein parameters of the feature extraction model are frozen while training the image processing model, and only the feature matching model and the decoder are trained.
14. The method of claim 13, each training raw image and each training background image in a training dataset for training the image processing model is generated by: sampling a foreground image from a foreground dataset and sampling at least one background image from a background dataset; synthesizing the foreground image and the at least one background image to generate a training original image, and The at least one background image and a portion of the foreground image are synthesized to generate a training background image.
15. The method of claim 13, wherein the loss function for training the image processing model comprises a piecewise loss function, and wherein a semitransparent region for an image is weighted higher than a foreground region and a background region for an image in the piecewise loss function, the semitransparent region being a portion of the image between the foreground region and the background region.
16. The method of any of claims 1-15, wherein the original image is an image obtained by capturing a foreground object on a virtual background, and the background image is an image of the virtual background rendered at a perspective consistent with the original image.
17. An image processing apparatus, the apparatus comprising: A receiving unit configured to receive an input image pair including an original image and a background image corresponding to a background region in the original image; a feature extraction unit configured to perform feature extraction on the input image pair to obtain a first feature vector characterizing the original image and a second feature vector characterizing the background image, respectively; A feature matching unit configured to perform feature matching based on feature content similarity on the first feature vector and the second feature vector to calculate a front Jing Xiangliang characterizing a foreground region of the original image, and A decoding unit configured to decode the front Jing Xiangliang and generate a transparency mask of the original image based on the decoded front Jing Xiangliang, the value of each element in the transparency mask indicating a probability that a corresponding pixel in the original image belongs to a foreground region.
18. An image processing apparatus comprising: One or more processors, and One or more memories having stored therein computer readable instructions that, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-16.
19. A computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-16.
20. A computer program product comprising computer readable instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1-16.

Description

Image processing method, apparatus, device, medium, and computer program product Technical Field The present disclosure relates to the field of image processing, and more particularly, to an image processing method, an image processing apparatus and device, a computer-readable storage medium, and a computer program product. Background With the development of the video industry, the technology of virtual fabrication (Virtual Production, VP) based on large light emitting Diode (LIGHT EMITTING Diode, LED) display walls is gradually replacing the traditional green screen shooting. In the virtual manufacturing process, a 3D scene is projected to an LED wall behind a foreground object such as an actor through a real-time rendering engine, the internal view cone technology is utilized to realize the presentation of a virtual background with correct perspective, and then the foreground object is shot on the virtual background. This technique brings a "what you see is what you get" shooting experience and realistic reflection of ambient light. However, virtual fabrication techniques also introduce new post-fabrication challenges. Since the background picture is directly photographed by the camera, the foreground object and the background are fused together in the original material. If the background illumination needs to be adjusted, background assets need to be modified, color correction needs to be performed at a later stage, foreground objects need to be separated from a complex LED background. The separation of the foreground objects can be carried out by manually turning the frames (Rotoscoping), but the cost is extremely high, and the production requirement of a long lens cannot be met. Therefore, automated high quality matting techniques become a necessity in virtual manufacturing processes. Disclosure of Invention The present disclosure provides an image processing method, an image processing apparatus and device, a computer-readable storage medium, and a computer program product. According to one aspect of the disclosure, an image processing method is provided, which comprises the steps of receiving an input image pair, wherein the input image pair comprises an original image and a background image corresponding to a background area in the original image, performing feature extraction on the input image pair to respectively obtain a first feature vector representing the original image and a second feature vector representing the background image, performing feature matching based on feature content similarity on the first feature vector and the second feature vector to calculate a front Jing Xiangliang representing a foreground area of the original image, decoding the front Jing Xiangliang, and generating a transparency mask of the original image based on the decoded front Jing Xiangliang, wherein a value of each element in the transparency mask indicates a probability that a corresponding pixel in the original image belongs to the foreground area. In accordance with one or more embodiments of the present disclosure, performing feature matching based on feature content similarity on the first feature vector and the second feature vector includes performing feature matching based on feature content similarity on the first feature vector and the second feature vector with a plurality of attention layers, each of the plurality of attention layers including at least a cross-attention layer. In accordance with one or more embodiments of the present disclosure, the feature matching includes performing a cross-attention process with the cross-attention layer based on a key vector and a value vector from a first feature vector of the original image and a second feature vector of the background image. In accordance with one or more embodiments of the present disclosure, the image processing method further includes, prior to the plurality of attention layers, feature-aligning the first feature vector and the second feature vector to semantically spatially align the first feature vector and the second feature vector. In accordance with one or more embodiments of the present disclosure, the image processing method further comprises feature upsampling the original image to obtain a high frequency information component in the original image, wherein decoding the foreground vector comprises decoding the front Jing Xiangliang with at least one decoder layer to generate an initial decoded front Jing Xiangliang, and generating the decoded foreground vector based on the initial decoded front Jing Xiangliang and the high frequency information component. In accordance with one or more embodiments of the present disclosure, feature upsampling the original image to obtain a high frequency information component in the original image includes generating a query vector and a key vector based on the original image and a value vector based on the first feature vector, and cross-attention processing using the query vector, the key vect