EP-4738351-A1 - REAL-TIME VOCAL REMOVAL FROM AN AUDIO SOURCE

EP4738351A1EP 4738351 A1EP4738351 A1EP 4738351A1EP-4738351-A1

Abstract

Various embodiments disclose a computer-implemented method comprising receiving an audio source for playback by an audio playback system, identifying a left channel and a right channel associated with the audio source, generating a modified left channel comprising the right channel subtracted from the left channel, generating a modified right channel comprising the left channel subtracted from the right channel, causing playback, on a left channel speaker of the audio playback system, of the modified left channel, and causing playback, on a right channel speaker of the audio playback system, of the modified right channel.

Inventors

WILLIS, MAXWELL B.
Daftuar, Rishi Kumar

Assignees

Harman Becker Automotive Systems GmbH

Dates

Publication Date: 20260506
Application Date: 20251002

Claims (15)

A computer-implemented method comprising: receiving an audio source for playback by an audio playback system; identifying a left channel and a right channel associated with the audio source; generating a modified left channel comprising the right channel subtracted from the left channel; generating a modified right channel comprising the left channel subtracted from the right channel; causing playback, on a left channel speaker of the audio playback system, of the modified left channel; and causing playback, on a right channel speaker of the audio playback system, of the modified right channel.
The computer-implemented method of claim 1, further comprising: generating a center channel comprising the left channel summed with the right channel; and causing playback, on at least one speaker of the audio playback system, of the center channel.
The computer-implemented method of claim 2, further comprising: generating a modified center channel by removing a vocal component from the center channel; and causing playback, on the at least one speaker of the audio playback system, of the modified center channel.
The computer-implemented method of claim 3, wherein generating the modified center channel comprises muting, attenuating, or ducking a mid-band component of the center channel.
The computer-implemented method of claim 3 or 4, wherein generating the modified center channel comprises compressing the center channel by reducing a dynamic range of the center channel to generate the modified center channel.
The computer-implemented method of any of claims 3 to 5, further comprising detecting a vocal input from a microphone coupled to the audio playback system, wherein causing playback of the modified center channel is performed in response to detecting the vocal input.
The computer-implemented method of claim 6, further comprising: detecting a termination of the vocal input; and causing playback, on the at least one speaker of the audio playback system, of the center channel in response to detecting the termination of the vocal input.
The computer-implemented method of claim 6 or 7, wherein detecting the vocal input comprises detecting a user input via a microphone or a user input device.
The computer-implemented method of any of claims 2 to 8, wherein the at least one speaker of the audio playback system comprises a center channel speaker.
The computer-implemented method of any of claims 2 to 9, wherein the at least one speaker of the audio playback system comprises the left channel speaker and the right channel speaker.
The computer-implemented method of any preceding claim, further comprising causing playback, on at least one speaker of the audio playback system, of a vocal input received from at least one microphone coupled to the audio playback system.
One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause the one or more processors to perform the of a method as mentioned in any of claims 1 to 11.
The one or more non-transitory computer-readable media of claim 12, wherein generating the modified center channel is performed in response to user selection of a karaoke mode.
A system comprising: one or more audio output devices; a memory storing an audio playback application; and a processor coupled to the memory that executes the audio playback application by performing the steps of: receiving an audio source for playback by an audio playback system; identifying a left channel and a right channel associated with the audio source; generating a modified left channel comprising the right channel subtracted from the left channel; generating a modified right channel comprising the left channel subtracted from the right channel; causing playback, on a left channel speaker of the audio playback system, of the modified left channel; and causing playback, on a right channel speaker of the audio playback system, of the modified right channel.
The system of claim 14, wherein the one or more audio output devices, the memory, and the processor are integrated into a vehicle.

Description

BACKGROUND Field of the Various Embodiments The various embodiments relate generally to audio processing and, more specifically, to real-time vocal removal from an audio source. Description of the Related Art Modem vehicles include in-vehicle infotainment (IVI) systems that receive audio and video inputs from various sources. The IVI system includes various output devices, such as displays and loudspeakers that are positioned throughout the vehicle. An IVI system obtains an input, such as an audio input, selected by a user from a local or remote audio source, and plays back the audio input using an output device in the vehicle. Karaoke experiences can be provided by an IVI system and involve one or more users singing along with a prerecorded audio performance that is played back by an audio output device of the IVI system. A user sings along with the prerecorded audio performance and in some instances, a microphone is utilized to capture the user's voice, which is reproduced using the same audio output device that plays back the prerecorded audio performance. In some cases, users prefer to utilize an audio source from which the primary and/or background vocals have been removed. Some prerecorded audio performances are created specifically for use with karaoke experiences by preprocessing an audio source to remove vocal components. The preprocessing is generally performed by a person, such as an audio engineer or producer, or by an automated vocal removal algorithm, and the preprocessed audio source is provided as an audio source to an audio playback system. In other examples, a prerecorded audio performance for use with a karaoke experience is created by recording an instrumental version of a audio source without primary and/or secondary vocals. In either scenario, creating a version of a audio source for use in a karaoke experience requires preprocessing or pre-recording the audio source that it used for the karaoke experience. Another technique for providing a karaoke experience involves playing back a audio source and allowing the user to sing over the unmodified version of the audio source. However, a karaoke experience that is provided using audio sources containing vocals results in a poor karaoke experience for many users. Some karaoke experiences provide mechanisms for real-time suppression of vocal components of an audio source that is played back during a karaoke experience. One technique for real-time suppression of vocal components is performing mid-band ducking of an audio source, which lowers the volume of the mid-band component of an audio signal, which is where vocal components are often contained. However, with mid-band ducking, other components of the audio other than vocal components are removed, such as instrumental components, degrading the quality of the karaoke experience. Additionally, in the case of a 5.1. 7.1, or other multi-channel audio sources, vocal components are often included in a center channel of the multi-channel audio source. Therefore, the center channel component can be removed or ducked, which lowers the volume of the channel in which vocal components are often contained. However, 5.1, 7.1, or other multi-channel audio sources are often unavailable. One drawback with utilizing conventional techniques for removing vocal components from audio sources to provide a karaoke experience is that many vocal remover algorithms cannot be utilized in real-time. Vocal removing algorithms often require significant processing time that prevents the algorithms from being used in a real-time manner, such as on audio sources that are streamed for playback. Additionally, utilizing prerecorded karaoke versions of an audio source does not allow users to have a karaoke experience for all audio sources that are played back by the audio playback system. A drawback of performing mid-band ducking on the left and right channels of an audio source is that components of an audio source other than vocal components are removed by these techniques, which degrades the quality of the karaoke experience. A drawback of performing center channel ducking of an audio source containing a discrete center channel is that a discrete center channel is often unavailable for music. As the foregoing illustrates, what is needed in the art are more effective techniques for processing audio sources that provide an acceptable karaoke experience for users. SUMMARY In various embodiments, a computer-implemented method comprises receiving an audio source for playback by an audio output device, identifying a left channel and a right channel associated with the audio source, causing playback, on a left channel of the audio output device, of a modified left channel comprising the right channel subtracted from the left channel, and causing playback, on a right channel of the audio output device, of a modified right channel comprising the left channel subtracted from the right channel. At least one technical advantage of the