Search

EP-4738354-A2 - USING AUDIO CLASSIFICATION TO ENHANCE AUDIO IN VIDEOS

EP4738354A2EP 4738354 A2EP4738354 A2EP 4738354A2EP-4738354-A2

Abstract

A media application obtains a video that includes an audio portion. The media application separates the audio portion into a plurality of channels, where each channel corresponds to a particular audio source. An on-screen classifier model obtains an indication of whether the particular audio source for each channel is depicted in the video. An audio-type classifier model determines, an auditory object classification for each channel. The media application determines a respective gain for each channel based on the indication of whether the particular audio source for the channel is depicted in the video and the auditory object classification for the channel. The media application modifies each channel by applying the respective gain. The media application mixes the modified channels with the audio portion to generate a combined audio.

Inventors

  • KIM, Moonseok
  • PATROS, Elliot
  • SINGARAJU, Sneh
  • ANSAI, Michelle

Assignees

  • Google LLC

Dates

Publication Date
20260506
Application Date
20230809

Claims (13)

  1. A computer-implemented method (600) comprising: obtaining (602) a video that includes an audio portion; separating (604) the audio portion into a plurality of channels, wherein each channel corresponds to a particular audio source; obtaining (606), with an on-screen classifier model, an indication of whether the particular audio source for each channel is depicted in the video, wherein image embeddings for a plurality of video frames of the video and audio embeddings for the plurality of channels are provided as input to the on-screen classifier model; determining (608), with an audio-type classifier model, an auditory object classification for each channel; providing a user interface to a user for specifying user preferences related to one or more types of audio sources; receiving a selection of the one or more types of audio sources from the user via the user interface; determining (610) a respective gain for each channel based on the indication of whether the particular audio source for the channel is depicted in the video, the auditory object classification for the channel, and the selection of the one or more types of audio sources from the user; modifying (612) each channel by applying the respective gain; and after the modifying, mixing (614) the modified channels with the audio portion to generate a combined audio.
  2. The method of claim 1, wherein: the auditory object classification is one of: an enhancer type or a distractor type; and determining the respective gain for each channel based on the indication of whether the particular audio source for the channel is depicted in the video and the auditory object classification for the channel comprises determining the respective gain to each channel such that a volume level of channels associated with the enhancer type is raised and a volume level of channels associated with the distractor type is lowered.
  3. The method of claim 2, further comprising after receiving the selection of the one or more types of audio sources, classifying the particular audio source as the enhancer type.
  4. The method of claim 2 or 3, wherein separating the audio portion into the plurality of channels is such that each of the plurality of channels is associated with a respective sound type.
  5. The method of claim 4, wherein one or more of the plurality of channels is obtained by combining two or more audio sources in the audio portion that are of a same sound type.
  6. The method of any one of the preceding claims, wherein the user interface includes an option for modifying a multiplier or an offset for respective one or more types of audio sources.
  7. The method of any one of the preceding claims, wherein the user interface includes a field for providing an additional type of an audio source that is not included in the user interface.
  8. The method of any one of the preceding claims, further comprising mixing at least a part of higher-frequency portions of the audio portion in with the combined audio.
  9. The method of any one of the preceding claims, wherein the respective gain for each channel is further based on a confidence associated with the indication and a confidence associated with the auditory object classification.
  10. The method of any one of the preceding claims, further comprising converting, with a resampler, the audio portion to a processing sampling rate, wherein mixing the modified channels with the audio portion includes mixing the modified channels with the converted audio portion.
  11. The method of any one of the preceding claims, wherein the audio embeddings represent respective audio features for each of the plurality of channels.
  12. A non-transitory computer-readable medium with instructions stored thereon that, when executed by one or more computers, cause the one or more computers to perform steps of the method according to any one of the preceding claims.
  13. A computing device comprising: a processor; and a memory coupled to the processor, with instructions stored thereon that, when executed by the processor, cause the processor to perform steps of the method according to any one of the preceding claims 1 to 11.

Description

BACKGROUND Capturing high-quality video on a mobile device is possible with the quality of camera hardware on mobile devices such as smartphones, tablets, etc. However, because mobile devices may not include studio-quality audio hardware, e.g., directional microphones, microphones with tuned sensitivity, etc., capturing high-quality audio is not possible. Due to the small form factor and other limitations (e.g., battery), mobile devices are not large enough to accommodate such hardware. To overcome limitations in capturing high-quality audio when capturing video using a mobile device, professional videographers may use wireless lavalier microphones, shotgun microphones with passive wind screen, shock-absorbing mounts, and the like. However, a casual user that wants to record a video has to rely on the mobile device hardware for audio capture. Manufacturers of mobile devices have tried to provide audio enhancement algorithms to make up for audio hardware deficiencies. However, it may be difficult to obtain high quality results with such techniques. The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure. SUMMARY A computer-implemented method includes obtaining a video that includes an audio portion. The method further includes separating the audio portion into a plurality of channels, where each channel corresponds to a particular audio source. The method further includes obtaining, with an on-screen classifier model, an indication of whether the particular audio source for each channel is depicted in the video, where image embeddings for a plurality of video frames of the video and audio embeddings for the plurality of channels are provided as input to the on-screen classifier model. The method further includes determining, with an audio-type classifier model, an auditory object classification for each channel. The method further includes determining a respective gain for each channel based on the indication of whether the particular audio source for the channel is depicted in the video and the auditory object classification for the channel. The method further includes modifying each channel by applying the respective gain. The method further includes after the modifying, mixing the modified channels with the audio portion to generate a combined audio. In some embodiments, , the auditory object classification is one of: an enhancer type or a distractor type and determining the respective gain for each channel based on the indication of whether the particular audio source for the channel is depicted in the video and the auditory object classification for the channel comprises determining the respective gain to each channel such that a volume level of channels associated with the enhancer type is raised and a volume level of channels associated with the distractor type is lowered. In some embodiments, separating the audio portion into the plurality of channels is such that each of the plurality of channels is associated with a respective sound type. In some embodiments, one or more of the plurality of channels is obtained by performing deduplication to combine two or more audio sources in the audio portion that are of a same sound type. In some embodiments, the image embeddings represent respective local video features for a plurality of regions of a frame of the video. In some embodiments, the audio embeddings represent respective local audio features for each of the plurality of channels. In some embodiments, the respective gain for each channel is based on a confidence associated with the indication and a confidence associated with the auditory object classification. In some embodiments, the method further includes mixing at least a part of the audio portion in with the combined audio. In some embodiments, the method further includes mixing at least a part of higher-frequency portions of the audio portion in with the combined audio. In some embodiments, the separating is performed using an audio-separation model wherein the audio-separation model uses the image embeddings as a conditioning input, wherein the conditioning input provides cues to audio-separation model about audio sources present in the video. In some embodiments, a non-transitory computer-readable medium with instructions stored thereon that, when executed by one or more computers, cause the one or more computers to perform operations including: obtaining a video that includes an audio portion; separating the audio portion into a plurality of channels, where each channel corresponds to a particular audio source; obtaining, with an on-screen classifier model, an indication of whether the particula