US-12621417-B1 - Face-aware relighting of live video content

US12621417B1US 12621417 B1US12621417 B1US 12621417B1US-12621417-B1

Abstract

Systems and methods are provided for modifying video content to improve lighting of a person's face depicted within the video content. Pixels depicting skin may be detected in a first frame of the video content. Transformation parameters may then be determined based on intensity values of the pixels depicting skin, where the transformation parameters represent adjusted pixel intensity values determined to improve at least one of brightness or contrast of the pixels depicting skin. Based on the transformation parameters, intensity association data may be generated and stored that associates each possible input pixel intensity value in at least one channel with a corresponding adjusted pixel intensity value. The stored intensity association data as determined with respect to the first frame may then be reused to modify intensity values for a series of frames of the video content.

Inventors

Prerit JAISWAL
Mehmet Umut Isik
Srikanth Venkata TENNETI
Samuel J Wilson
Amritpal Singh Saini
Parisa Rahimzadeh
Amalavoyal Chari
Michael Mark Goodwin

Assignees

AMAZON TECHNOLOGIES, INC.

Dates

Publication Date: 20260505
Application Date: 20230331

Claims (18)

1 . A system comprising: memory; and at least one computing device configured with computer-executable instructions that, when executed, cause the at least one computing device to: obtain a first frame of video content, wherein the video content depicts at least a human face as captured by a camera, wherein the camera is part of the system or in local communication with the system; determine a bounding region within the first frame that includes depiction of the human face, wherein the bounding region is determined at least in part by providing the first frame as input to a machine learning model trained to detect depiction of human faces within an input image; apply segmentation to image data within the bounding region of the first frame to identify pixels depicting skin; determine transformation parameters based on intensity values of the pixels depicting skin, wherein the transformation parameters represent at least adjusted pixel intensity values determined by the system to improve at least one of brightness or contrast of the pixels depicting skin, wherein the transformation parameters are determined with respect to at least one channel of a color space; store a lookup table based on the transformation parameters, wherein the lookup table comprises, for each possible input pixel intensity value in the at least one channel of the color space, a corresponding adjusted pixel intensity value determined by the system to improve the at least one of brightness or contrast of the pixels depicting skin; generate a relit first frame of the video content, wherein the relit first frame of the video content is generated at least in part by changing an intensity value, for the at least one channel, of each pixel in the first frame to a corresponding adjusted pixel intensity value identified in the lookup table; obtain a second frame of the video content; determine, based at least in part on a comparison of lighting in the second frame relative to the first frame, to reuse the lookup table as stored with respect to the first frame; and based on the determination to reuse the lookup table, generate a relit second frame of the video content, wherein the relit second frame of the video content is generated at least in part by changing an intensity value, for the at least one channel, of each pixel in the second frame to a corresponding adjusted pixel intensity value identified in the lookup table, wherein the relit second frame of the video content is generated without detecting the human face within the second frame and without applying segmentation to the image data of the second frame.
2 . The system of claim 1 , wherein the color space is represented as Hue, Saturation, Value (HSV), and wherein the at least one channel for which the adjusted pixel intensity values are determined comprises a Value channel.
3 . The system of claim 1 , wherein the computer-executable instructions, when executed, further cause the at least one computing device to: subsequent to generating the relit second frame of the video content using the lookup table, obtain a third frame of the video content; determine, based at least in part on a detected change in lighting between the third frame and at least one of the first frame or the second frame, to modify the lookup table; determine new transformation parameters based on intensity values of pixels depicting skin within the third frame; and store a modified lookup table based at least in part on the new transformation parameters, wherein the modified lookup table comprises, for each possible input pixel intensity value in the at least one channel of the color space, a corresponding newly adjusted pixel intensity value determined to improve the at least one of brightness or contrast of the pixels depicting skin within the third frame.
4 . The system of claim 3 , wherein each of the newly adjusted pixel intensity values in the modified lookup table comprise a moving average that is based at least in part on a corresponding value in the lookup table.
5 . A computer-implemented method comprising: obtaining a first frame of video content, wherein the video content depicts at least a human face as captured by a camera; applying segmentation to image data of the first frame to identify pixels depicting skin; determining transformation parameters based at least in part on intensity values of the pixels depicting skin, wherein the transformation parameters represent at least adjusted pixel intensity values determined to improve at least one of brightness or contrast of the pixels depicting skin, wherein the transformation parameters are determined with respect to at least one channel of a color space; based on the transformation parameters, storing intensity association data that associates each possible input pixel intensity value in the at least one channel of the color space with a corresponding adjusted pixel intensity value; generating a modified first frame of the video content, wherein the modified first frame of the video content is generated at least in part by changing an intensity value, for the at least one channel, of each pixel in the first frame to a corresponding adjusted pixel intensity value identified in the stored intensity association data; obtaining a second frame of the video content; and generating a modified second frame of the video content, wherein the modified second frame of the video content is generated at least in part by changing an intensity value, for the at least one channel, of each pixel in the second frame to a corresponding adjusted pixel intensity value identified in the stored intensity association data, wherein the modified second frame of the video content is generated without applying segmentation to the image data of the second frame.
6 . The computer-implemented method of claim 5 further comprising, prior to applying the segmentation: determining a bounding region within the first frame that includes depiction of the human face, wherein the bounding region is determined at least in part by providing the first frame as input to a machine learning model trained to detect depiction of human faces within an input image, wherein the segmentation to identify pixels depicting skin is performed with respect to image data within the bounding region.
7 . The computer-implemented method of claim 5 , wherein the color space is represented as Hue, Saturation, Value (HSV), and wherein the at least one channel for which the adjusted pixel intensity values are determined comprises a Value channel.
8 . The computer-implemented method of claim 5 further comprising: subsequent to generating the second frame of the video content, obtaining a third frame of the video content; determining, based at least in part on a detected change in lighting between the third frame and at least one of the first frame or the second frame, to modify the stored intensity association data; determining new transformation parameters based on intensity values of pixels depicting skin within the third frame; and storing modified intensity association data based at least in part on the new transformation parameters, wherein the modified intensity association data comprises, for each possible input pixel intensity value in the at least one channel of the color space, a corresponding newly adjusted pixel intensity value determined to improve the at least one of brightness or contrast of the pixels depicting skin within the third frame.
9 . The computer-implemented method of claim 8 , wherein each of the newly adjusted pixel intensity values in the modified intensity association data comprises a moving average that is based at least in part on a corresponding value in the stored intensity association data.
10 . The computer-implemented method of claim 5 , wherein the computer-implemented method is implemented by a mobile phone or personal computer as the video content is captured by the camera, wherein the camera is part of or in local communication with the mobile phone or personal computer.
11 . The computer-implemented method of claim 10 further comprising causing the mobile phone or personal computer to send, over a network, the modified first frame and modified second frame to a videoconference service for presentation to one or more devices participating in a videoconferencing instance.
12 . The computer-implemented method of claim 5 , wherein the transformation parameters are determined based at least in part on mean and standard deviation of the intensity values of the pixels depicting skin and a target intensity distribution.
13 . One or more non-transitory computer readable media including computer-executable instructions that, when executed by a computing system, cause the computing system to perform operations comprising: obtaining a first frame of video content, wherein the video content depicts at least a human face as captured by a camera; identifying, from among a plurality of pixels within the first frame, pixels depicting skin; determining transformation parameters based at least in part on intensity values of the pixels depicting skin, wherein the transformation parameters are determined with respect to at least one channel of a color space; based on the transformation parameters, storing intensity association data that associates each possible input pixel intensity value in the at least one channel of the color space with a corresponding adjusted pixel intensity value; generating a modified first frame of the video content, wherein the modified first frame of the video content is generated at least in part by changing an intensity value, for the at least one channel, of each pixel in the first frame to a corresponding adjusted pixel intensity value identified in the stored intensity association data; obtaining a second frame of the video content; and generating a modified second frame of the video content, wherein the modified second frame of the video content is generated at least in part by changing an intensity value, for the at least one channel, of each pixel in the second frame to a corresponding adjusted pixel intensity value identified in the stored intensity association data.
14 . The one or more non-transitory computer readable media of claim 13 , wherein the operations further comprise, prior to identifying the pixels depicting skin: determining a bounding region within the first frame that includes depiction of the human face, wherein the bounding region is determined at least in part by providing the first frame as input to a machine learning model trained to detect depiction of human faces within an input image, wherein the pixels depicting skin are identified with respect to image data within the bounding region.
15 . The one or more non-transitory computer readable media of claim 13 , wherein the operations further comprise determining and storing adjusted intensity association data with reference to a newly obtained frame of the video content in response to a trigger event.
16 . The one or more non-transitory computer readable media of claim 15 , wherein the trigger event is based on one of (a) a lighting change detected within the newly obtained frame relative to a prior frame, or (b) a determination that an amount of time that has passed since a previous update to the stored intensity association data meets a threshold.
17 . The one or more non-transitory computer readable media of claim 13 , wherein the transformation parameters are determined based at least in part on distribution of the intensity values of the pixels depicting skin and a target intensity distribution.
18 . The one or more non-transitory computer readable media of claim 13 , wherein the adjusted pixel intensity values are determined to improve at least one of brightness, contrast, saturation, tint, color, or exposure of the pixels depicting skin.

Description

BACKGROUND Videoconferencing software enables people to communicate in real time via video and audio over the Internet, relying in part on cameras and microphones that may be integrated within or in local communication with participants' computers or smartphones. When a videoconference call is initiated, local and/or remotely executed software may operate to establish a connection (which may be indirect) between the participants' devices and stream video and audio data back and forth, typically via a server or network-accessible videoconferencing service. While existing videoconferencing software enables remote communication and collaboration between people using any of a variety of hardware, including commonly owned devices such as a laptop computer with integrated camera, participants' faces are not always well lit or even clearly visible in the streaming video content. For example, a video conferencing participant's face may not be sufficiently viewable by other participants during a videoconference for reasons that may include poor lighting in the user's physical environment, improper placement of lighting (e.g., light sources behind a user's face rather than in front of it), improper camera settings, and/or low camera quality. BRIEF DESCRIPTION OF DRAWINGS Embodiments of various inventive features will now be described with reference to the following drawings. The drawings are provided to illustrate example embodiments described herein and are not intended to limit the scope of the disclosure. FIG. 1 is a block diagram depicting high level steps and data flow for adjusting video frames of an incoming video stream to improve the apparent lighting of a person's face appearing therein, according to some embodiments. FIG. 2 depicts an illustrative operating environment for implementing aspects of the present disclosure, according to some embodiments. FIG. 3 is a flow diagram of an illustrative method for transforming pixel values within video frames to improve the lighting of a person's face appearing within the video content, according to some embodiments. FIG. 4 graphically depicts a first histogram of an incoming video frame and a corresponding second, adjusted histogram that may be determined in order to apply appropriate pixel value adjustments to the frame to improve lighting and contrast, according to one example. FIG. 5 is a block diagram depicting an illustrative architecture for a computing system that may implement one or more of the features described. DETAILED DESCRIPTION Generally described, aspects of the present disclosure relate to machine learning-based approaches for modifying captured video content to improve lighting of a subject, specifically the visibility, brightness and/or contrast of human faces within video content. Relighting approaches disclosed herein can be performed in a computationally efficient manner that can be implemented in real time or substantially real time as video is captured by a camera on consumer-grade devices, such as laptop computers or mobile devices. While improved video lighting adjustments will often be discussed herein in association with improving lighting of a participant's face during video calling or video conferences, these features may alternatively be used to improve lighting of a face or other object of interest in other contexts of video capture or processing, such as live video streaming, broadcasting, or storing captured video locally on a user's device. In some embodiments, face relighting features performed with respect to video content described herein may include (i) face detection to detect one or more faces, such as using a machine learning model, in an initial frame of the video content, (ii) skin segmentation, such as using a machine learning model, to identify pixels in the face region that depict skin, and then (iii) histogram adjustment that adjusts the colors of the input frame so that the face appears better lit, such as having improved brightness and/or contrast. For efficiency purposes, steps (i) and (ii) do not need to be executed with respect to every frame, as they are computationally more complex than step (iii), which can be performed efficiently every frame using a lookup table or other data structure, as will be described in more detail herein. As will be further described below, steps (i) and (ii) may only be repeated for subsequent frames in response to trigger events, such as a certain time duration passing (e.g., determining new transformation parameters every three seconds) or a detected lighting change in the video content (e.g., a comparison of brightness in the new frame relative to a previous frame suggests that that a light source moved or changed in the real world scene, or a person in the frame moved positions relative to real world light sources in their environment). Thus, according to some embodiments, pixels depicting skin may be detected in a first frame of input video content, such as video content recei