CN-122023632-A - Image and video processing method and device

CN122023632ACN 122023632 ACN122023632 ACN 122023632ACN-122023632-A

Abstract

The application provides an image and video processing method and device, the method comprises the steps of synchronously collecting a first media stream and a second media stream, wherein the first media stream is collected by a front camera arranged on an electronic device, the second media stream is collected by a rear camera arranged on the electronic device, background optimization processing is conducted on the second media stream to generate a corresponding third media stream, region positioning processing is conducted on the first media stream to obtain an iris region, iris segmentation processing is conducted on the iris region to generate a corresponding mask set, perspective transformation processing and illumination adaptation processing are conducted on the third media stream based on the mask set to generate a fourth media stream, and the fourth media stream is displayed in real time or statically. The application can support dynamic and static integral shooting, has intelligent scene understanding and purifying capability, can realize iris dynamic tracking with high precision and time sequence smoothness, and can enhance fusion reality sense and self-adaptive capability.

Inventors

DONG CHEN
YU KE
Qu Lingwen
TENG LEI
XU XIAODONG

Assignees

北京邮电大学

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (10)

1. A method of processing images and video, the method comprising: synchronously collecting a first media stream and a second media stream, wherein the first media stream is collected by a front camera arranged on the electronic equipment, and the second media stream is collected by a rear camera arranged on the electronic equipment; Performing background optimization processing on the second media stream to generate a corresponding third media stream; The first media stream is subjected to region positioning processing to obtain an iris region, and iris segmentation processing is performed on the iris region to generate a corresponding mask set; performing perspective transformation processing and illumination adaptation processing on the third media stream based on the mask set to generate a fourth media stream; and displaying the fourth media stream in real time or statically.
2. The method of claim 1, wherein performing a background optimization process on the second media stream to generate a corresponding third media stream comprises: and carrying out noise reduction, color correction, sharpening and fish-eye distortion treatment on the second media stream to generate the third media stream.
3. The method of claim 1, wherein when the capture mode is an image capture mode, the second media stream comprises a plurality of second images, the third media stream comprises respective third images corresponding to the respective second images; correspondingly, after the background optimization processing is performed on the second media stream to generate a corresponding third media stream, the method further comprises: screening the third image with minimum shielding from a plurality of third images to be used as a reference image; Detecting and target masking processing is carried out on the reference image based on a first neural network, so that a mask vector is obtained; and performing mask region image complement processing on the reference image by using the mask vector based on a second neural network, and generating a third media stream of which the target is removed.
4. A method according to claim 3, wherein the first neural network comprises a modified YOLOv network, the total loss function of which comprises CIoU loss, focal loss and Dice loss.
5. A method according to claim 3, wherein the second neural network comprises a Transformer-based image completion network, the total loss function of which comprises an L1 loss, a perceived loss representing a semantic feature level difference of the third media stream of the removal target and the reference image, and a style loss representing a visual style feature level consistency of the third media stream of the removal target and the reference image.
6. The method of claim 1, wherein when the capture mode is a video capture mode, the set of masks for each frame comprises a temporal mask sequence; correspondingly, after the first media stream is subjected to region positioning processing to obtain an iris region, and iris segmentation processing is performed on the iris region to generate a corresponding mask set, the method further comprises the steps of: calculating iris ellipse model parameters based on the time sequence mask sequence; and carrying out time sequence correction on the iris elliptic model parameters by combining motion prediction and weighted smoothing to obtain smoothed iris elliptic model parameters.
7. The method of claim 1, wherein the iris segmentation process is performed using a third neural network based on a U-Net structure, and wherein the total Loss function is a weighted sum of the Dice Loss and the cross entropy Loss.
8. The method of claim 2, wherein the fish eye distortion process is performed using a polynomial distortion model, the calculation formula being as follows: Wherein, the Representing the distance of the original image point to the center of the distortion, The abscissa of the original pixel is represented, Representing the ordinate of the original pixel, The abscissa representing the center of the distortion, Representing the ordinate of the centre of the distortion, Represents a recommended value of a distortion coefficient calibrated based on human eye physiological imaging data, Representing the distance from the corresponding pixel point to the distortion center after the distortion, Representing the abscissa of the distorted corresponding pixel, And the ordinate of the corresponding pixel point after distortion is represented.
9. The method of claim 1, wherein the perspective transformation process is performed based on a three-dimensional face pose calculation perspective matrix.
10. An electronic device comprising a processor and a memory, wherein the processor implements the image and video processing method of any one of claims 1 to 9 when executing an operating program stored in the memory.

Description

Image and video processing method and device Technical Field The present application relates to the field of electronic devices, and in particular, to a method and an apparatus for processing images and videos. Background With the popularization of mobile terminals such as smart phones and tablet computers and the continuous improvement of the performance of cameras, users are increasingly pursuing shooting interest, originality and aesthetic feeling. In conventional self-photographing or person photographing, the human eye typically reflects directly off the photographing device itself or a light source in the surrounding environment, lacking in visual appeal and artistic expression. In order to improve the interest and aesthetic feeling of the photographed image, the industry has begun to explore a technique of fusing different image contents into the iris of the human eye, so that the human eye can "mirror" the picture of another scene, thereby creating a unique visual effect of "in-eye viewing". However, the prior art has the defects of single function, only support of a static image mode, lack of intelligent scene purification capability, simple iris processing and tracking mechanism, possibility of unnatural or jumping, and insufficient natural light and shadow matching of fusion content with an eyeball curved surface and surrounding environment. Disclosure of Invention In view of the foregoing, embodiments of the present application provide a method and apparatus for processing images and videos, which obviate or mitigate one or more of the disadvantages of the related art. One aspect of the present application provides a method of processing images and videos, the method comprising the steps of: synchronously collecting a first media stream and a second media stream, wherein the first media stream is collected by a front camera arranged on the electronic equipment, and the second media stream is collected by a rear camera arranged on the electronic equipment; Performing background optimization processing on the second media stream to generate a corresponding third media stream; The first media stream is subjected to region positioning processing to obtain an iris region, and iris segmentation processing is performed on the iris region to generate a corresponding mask set; performing perspective transformation processing and illumination adaptation processing on the third media stream based on the mask set to generate a fourth media stream; and displaying the fourth media stream in real time or statically. In some embodiments of the present application, performing background optimization processing on the second media stream to generate a corresponding third media stream, including: and carrying out noise reduction, color correction, sharpening and fish-eye distortion treatment on the second media stream to generate the third media stream. In some embodiments of the present application, when the photographing mode is an image photographing mode, the second media stream includes a plurality of second images; correspondingly, after the background optimization processing is performed on the second media stream to generate a corresponding third media stream, the method further comprises: screening the third image with minimum shielding from a plurality of third images to be used as a reference image; Detecting and target masking processing is carried out on the reference image based on a first neural network, so that a mask vector is obtained; and performing mask region image complement processing on the reference image by using the mask vector based on a second neural network, and generating a third media stream of which the target is removed. In some embodiments of the application, the first neural network comprises a modified YOLOv network whose total loss function includes CIoU loss, focal loss, and Dice loss. In some embodiments of the application, the second neural network comprises a Transformer-based image completion network, the total loss function of which comprises an L1 loss, a perceived loss representing a semantic feature level difference of the third media stream of the removal target from the reference image, and a style loss representing a visual style feature level consistency of the third media stream of the removal target from the reference image. In some embodiments of the application, when the capture mode is a video capture mode, the set of masks for each frame constitutes a time-sequential mask sequence; correspondingly, after the first media stream is subjected to region positioning processing to obtain an iris region, and iris segmentation processing is performed on the iris region to generate a corresponding mask set, the method further comprises the steps of: calculating iris ellipse model parameters based on the time sequence mask sequence; and carrying out time sequence correction on the iris elliptic model parameters by combining motion prediction and weighted smoothing to