EP-4738340-A1 - SEGMENTATION OF AUDIO SOURCE FOR VOCAL REMOVAL

EP4738340A1EP 4738340 A1EP4738340 A1EP 4738340A1EP-4738340-A1

Abstract

Various embodiments disclose a computer-implemented method comprising receiving an audio source for playback, extracting a first segment of the audio source, the first segment comprising a first portion of the audio source, removing a first vocal component from the first segment to create a first modified segment, and causing playback of at least a subsegment of the first modified segment using one or more audio output devices.

Inventors

WILLIS, MAXWELL B.
WINTON, Riley
TRESTAIN, Christopher, Michael

Assignees

Harman Becker Automotive Systems GmbH

Dates

Publication Date: 20260506
Application Date: 20251015

Claims (15)

A computer-implemented method comprising: receiving an audio source for playback; extracting a first segment of the audio source, the first segment comprising a first portion of the audio source; removing a first vocal component from the first segment to create a first modified segment; and causing playback of at least a subsegment of the first modified segment using one or more audio output devices.
The computer-implemented method of claim 1, wherein removing the first vocal component from the first segment comprises executing a vocal removing algorithm on the first segment to produce the first modified segment.
The computer-implemented method of claim 1 or 2, further comprising detecting a number of users, wherein the vocal removing algorithm is selected based on the number of users.
The computer-implemented method of any of claims 1 to 3, wherein the number of users comprises a number of occupants of a vehicle.
The computer-implemented method of any of claims 1 to 4, further comprising: extracting a second segment of the audio source, the second segment comprising a second portion of the audio source that is subsequent to the first portion; removing a second vocal component from the second segment to create a second modified segment; and causing playback of at least a subsegment of the second modified segment subsequent to the first modified segment using one or more audio output devices.
The computer-implemented method of any of claims 1 to 5, wherein the first segment and second segment temporally overlap.
The computer-implemented method of any of claims 1 to 6, wherein causing playback of the subsegment of the second modified segment subsequent to the subsegment of the first modified segment comprises cross-fading playback the subsegment of the first segment with the playback of the subsegment of the second segment subsequent to the first segment.
The computer-implemented method of any of claims 1 to 7, wherein causing playback of the second modified segment subsequent to the first modified segment comprises causing playback of at least a subsegment of the second modified segment upon completion of playback of at least the subsegment of the first modified segment.
The computer-implemented method of any of claims 1 to 8, further comprising selecting a size of the first segment based upon a processing time required to remove the first vocal component from the first segment.
The computer-implemented method of any of claims 1 to 9, wherein a size of the second segment is different from a size of the first segment.
The computer-implemented method of any of claims 1 to 10, wherein a processing time required to remove a second vocal component from a second segment that is subsequent to the first segment is less than a playback time of the first segment.
One or more non-transitory computer-readable media storing instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: receiving an audio source for playback; extracting a first segment of the audio source, the first segment comprising a first portion of the audio source; removing a first vocal component from the first segment to create a first modified segment; and causing playback of at least a subsegment of the first modified segment using one or more audio output devices.
The one or more non-transitory computer-readable media of claim 12, further comprising selecting a size of the first segment based upon a processing time required to remove the first vocal component from the first segment.
The one or more non-transitory computer-readable media of claim 12 or 13, wherein playback of the first segment is delayed based on the processing time.
A system comprising: one or more audio output devices; a memory storing an audio playback application; and a processor coupled to the memory that executes the audio playback application by performing the steps of the method of any of claims 1 to 11.

Description

BACKGROUND Field of the Various Embodiments The various embodiments relate generally to audio processing and, more specifically, to segmentation of an audio source for vocal removal. Description of the Related Art Modern vehicles include in-vehicle infotainment (IVI) systems that receive audio and video inputs from various sources. The IVI system includes various output devices, such as displays and loudspeakers that are positioned throughout the vehicle. An IVI system obtains an input, such as an audio input, selected by a user from a local or remote audio source, and plays back the audio input using an output device in the vehicle. Karaoke experiences can be provided by an IVI system and involve singing along with a prerecorded audio performance that is played back by an audio output device by the IVI system. A user sings along with the prerecorded audio performance and in some instances, a microphone is utilized to capture the user's voice, which is reproduced using the same audio output device that plays back the prerecorded audio performance. In some cases, users prefer to utilize an audio source from which the primary and/or background vocals have been removed. Some prerecorded audio performances are created specifically for use with karaoke experiences by preprocessing a song to remove vocal components. The preprocessing is generally performed by a person, such as an audio engineer or producer, or by an automated vocal removing algorithm, and the preprocessed song is provided as an audio source to an audio playback system. In other examples, a prerecorded audio performance for use with a karaoke experience is created by recording an instrumental version of a song without primary and/or secondary vocals. In either scenario, creating a version of a song for use in a karaoke experience requires preprocessing or pre-recording the song that it used for the karaoke experience. Another technique for providing a karaoke experience involves playing back a song and allowing the user to sing over the unmodified version of the song. However, a karaoke experience that is provided using audio sources containing vocals results in a poor karaoke experience for many users. Some karaoke experiences provide mechanisms for real-time suppression of vocal components of a song that is played back during a karaoke experience. One technique for real-time suppression of vocal components is performing mid-band ducking of an audio source, which lowers the volume of the mid-band component of an audio signal, which is where vocal components are often contained. However, with mid-band ducking, other components of the audio other than vocal components are removed, such as instrumental components, degrading the quality of the karaoke experience. Center channel ducking or suppression is a technique that is utilized in the case of 5.1. 7.1, or other multi-channel audio sources having a discrete center channel. However, many audio sources that include music are often two channel audio sources that lack a discrete center channel. One drawback with utilizing conventional techniques for removing vocal components from audio sources to provide a karaoke experience is that many vocal remover algorithms cannot be utilized in real-time. Vocal remover algorithms often require significant processing time that prevents the algorithms from being used in a real-time manner on audio sources that are streamed for playback. Additionally, utilizing prerecorded karaoke versions of a song does not allow users to have a karaoke experience for all audio sources that are played back by the audio playback system. A drawback of providing a karaoke experience with unmodified audio sources that contain vocals is a poor karaoke user experience. A drawback of performing mid-band ducking or center channel ducking is that components of an audio source other than vocal components are removed by these techniques, which degrades the quality of the karaoke experience. As the foregoing illustrates, what is needed in the art are more effective techniques for processing audio sources that provide an acceptable karaoke experience for users. SUMMARY In various embodiments, a computer-implemented method includes receiving an audio source for playback; extracting a first segment of the audio source, the first segment comprising a first portion of the audio source; removing a first vocal component from the first segment to create a first modified segment; and causing playback of at least a subsegment of the first modified segment using one or more audio output devices. At least one technical advantage of the disclosed techniques relative to the prior art is that, with the disclosed techniques, an audio source, such as a song that contains vocal components for which a user desires a karaoke experience, the vocal components of the audio source are removed substantially in real-time. By removing the vocal components of the song substantially in real time, a karaoke experience is