US-12620380-B1 - Systems, devices, and methods for dynamic synchronization of a prerecorded vocal backing track to a vocal performance

US12620380B1US 12620380 B1US12620380 B1US 12620380B1US-12620380-B1

Abstract

Disclosed are systems, methods, and devices, that overcome timing and self-expression limitations experienced by vocalists when using prerecorded vocal backing tracks. The disclosed system, devices, and methods, dynamically synchronizes prerecorded vocal backing tracks with a vocal stream by extracting vocal elements, such as phonemes, vector embeddings, or vocal audio spectra, from the vocal performance. These extracted vocal elements are matched against corresponding timestamped vocal elements previously derived from the prerecorded vocal backing track, enabling precise adjustment and alignment of the backing track timing to the vocalist's performance. Additionally, the system enhances expressive performance by identifying prosody factors, such as pitch, vibrato, accent, stress, dynamics, and level, in the vocal performance, and dynamically adjusting corresponding prerecorded prosody factors within predefined ranges. This maintains naturalness and spontaneity in the vocalist's performance, overcoming traditional limitations associated with prerecorded vocal backing tracks.

Inventors

Clayton Janes

Assignees

Eidol Corporation

Dates

Publication Date: 20260505
Application Date: 20251006

Claims (20)

1 . A method, comprising: identifying and extracting vocal elements from a vocal stream, by at least one of one or more processors, the vocal stream digitally representing a vocal performance; dynamically controlling timing of a prerecorded vocal backing track using the vocal elements extracted from the vocal stream matched to timestamped vocal elements from the prerecorded vocal backing track by at least one of the one or more processors; and outputting a resulting dynamically controlled prerecorded vocal backing track that is time-synchronized to the vocal stream.
2 . The method of claim 1 , further comprising: capturing the vocal performance to produce the vocal stream.
3 . The method of claim 1 , further comprising: capturing the vocal performance using analog-to-digital conversion.
4 . The method of claim 1 , further comprising: preprocessing the prerecorded vocal backing track before the vocal performance by identifying, extracting, and timestamping backing track vocal elements, creating the timestamped vocal elements.
5 . The method of claim 1 , wherein: dynamically controlling the timing of the prerecorded vocal backing track includes using time compression and expansion of the prerecorded vocal backing track based on timing differences between the vocal elements extracted from the vocal stream and the timestamped vocal elements from the prerecorded vocal backing track.
6 . The method of claim 1 , wherein: the timestamped vocal elements include timestamped phonemes; the vocal elements include phonemes; identifying and extracting the phonemes from the vocal stream; and dynamically controlling the timing of the prerecorded vocal backing track using the phonemes extracted from the vocal stream matched to the timestamped phonemes from the prerecorded vocal backing track.
7 . The method of claim 1 , wherein: the timestamped vocal elements include timestamped vector embeddings; the vocal elements include vector embeddings; identifying and extracting the vector embeddings from the vocal stream; and dynamically controlling the timing of the prerecorded vocal backing track using the vector embeddings extracted from the vocal stream matched to the timestamped vector embeddings from the prerecorded vocal backing track.
8 . The method of claim 1 , wherein: the timestamped vocal elements include timestamped vocal audio spectra; the vocal elements include vocal audio spectra; identifying and extracting the vocal audio spectra from the vocal stream; and dynamically controlling the timing of the prerecorded vocal backing track using the vocal audio spectra extracted from the vocal stream matched to the timestamped vocal audio spectra from the prerecorded vocal backing track.
9 . The method of claim 1 , wherein: the timestamped vocal elements include timestamped two or more types of vocal elements; the vocal elements include two or more types of vocal elements; identifying and extracting the two or more types of vocal elements from the vocal stream; and dynamically controlling the timing of the prerecorded vocal backing track in using the two or more types of vocal elements extracted from the vocal stream matched to the timestamped two or more types of vocal elements from the prerecorded vocal backing track.
10 . The method of claim 9 , further comprising: obtaining a confidence weight by comparing the two or more types of vocal elements to the timestamped two or more types of vocal elements by at least one of the one or more processors; and dynamically controlling the timing of the prerecorded vocal backing track based at least in part whether the confidence weight is above or below a predetermined confidence threshold by at least one of the one or more processors.
11 . A system, comprising: a tangible medium that includes non-transitory computer-readable instructions that, when applied to one or more processors, instructs the one or more processors to perform a method comprising: (a) identifying and extracting vocal elements from a vocal stream by at least one of the one or more processors, the vocal stream digitally representing a vocal performance; and (b) dynamically controlling timing of a prerecorded vocal backing track using the vocal elements extracted from the vocal stream matched to timestamped vocal elements from the prerecorded vocal backing track by at least one of the one or more processors; and outputting a resulting dynamically controlled prerecorded vocal backing track in that is time-synchronized to the vocal stream.
12 . The system of claim 11 , further comprising: the one or more processors.
13 . The system of claim 11 , further comprising: the one or more processors; and an analog-to-digital converter structured to digitally represent the vocal performance as the vocal stream.
14 . The system of claim 11 , wherein: the tangible medium instructs at least one of the one or more processors to dynamically control the timing of the prerecorded vocal backing track using time compression and expansion of the prerecorded vocal backing track based on timing differences between the vocal elements extracted from the vocal stream and the timestamped vocal elements from the prerecorded vocal backing track.
15 . The system of claim 11 , wherein: the timestamped vocal elements include timestamped phonemes; the vocal elements include phonemes; the tangible medium instructs at least one of the one or more processors to identify and extract the phonemes from the vocal stream; and the tangible medium instructs at least one of the one or more processors to dynamically control the timing of the prerecorded vocal backing track using the phonemes extracted from the vocal stream matched to the timestamped phonemes from the prerecorded vocal backing track.
16 . The system of claim 11 , wherein: the timestamped vocal elements include timestamped vector embeddings; the vocal elements include vector embeddings; and the tangible medium further instructs at least one of the one or more processors to dynamically controlling the timing of the prerecorded vocal backing track using the vector embeddings extracted from the vocal stream matched to timestamped vector embeddings from the prerecorded vocal backing track.
17 . The system of claim 11 , wherein: the timestamped vocal elements include timestamped vocal audio spectra; the vocal elements include vocal audio spectra; and the tangible medium further instructs at least one of the one or more processors to dynamically controlling the timing of the prerecorded vocal backing track using the vocal audio spectra extracted from the vocal stream matched to timestamped vocal audio spectra from the prerecorded vocal backing track.
18 . The system of claim 11 , wherein: the timestamped vocal elements include timestamped two or more types of vocal elements; the vocal elements include two or more types of vocal elements; and the tangible medium further instructs at least one of the one or more processors to dynamically control the timing of the prerecorded vocal backing track in using the two or more types of vocal elements matched to the timestamped two or more types of vocal elements from the prerecorded vocal backing track.
19 . The system of claim 18 , wherein: the tangible medium further instructs at least one of the one or more processors to obtain a confidence weight by comparing the two or more types of vocal elements to the timestamped two or more types of vocal elements; and the tangible medium further instructs at least one of the one or more processors to dynamically control the timing of the prerecorded vocal backing track based on at least in part whether the confidence weight is above or below a predetermined confidence threshold.
20 . The system of claim 11 , wherein: the tangible medium instructs at least one of the one or more processors to: extract vocal elements in a latent frame from a neural audio codec latent feature space and load a resulting extracted vocal elements into a predictive model; and forecast alignment of the resulting extracted vocal elements in a time interval ahead of a current frame position of the latent frame.

Description

BACKGROUND Audience enjoyment of live music often hinges on the quality and consistency of the vocalist's performance. Even seasoned professionals frequently encounter various challenges during live performances. These challenges may include vocal strain from rigorous touring schedules, age-related changes in vocal range and stamina, lifestyle factors impacting vocal health, fatigue from travel and from consecutive performances, and illness adversely impacting vocal quality. Such challenges may significantly diminish a vocalist's overall performance quality, undermining their confidence and detracting from the audience experience. To address such performance challenges, performing artists may utilize prerecorded vocal backing tracks. A prerecorded vocal backing track is a previously captured recording of a vocalist's performance, intended to support, supplement, or entirely replace segments of their live vocal performance. Typically, such tracks may be recorded in controlled settings, such as professional recording studios, to ensure optimal vocal quality. During live performances, a playback engineer manually cues and initiates playback of the prerecorded vocal backing track at precise moments. The front-of-house audio engineer subsequently mixes the prerecorded vocal backing track with the live vocal signal during selected portions of the performance, occasionally substituting the prerecorded track entirely for specific song segments. In scenarios where a prerecorded vocal backing track fully replaces or significantly supplements live vocals, the vocalist often must mime or “lip-sync” their performance so it visually aligns with the prerecorded vocal track. Prerecorded vocal backing tracks are also used in scenarios where the result is recorded rather than fed to a live audience. For example, motion picture films, television shows, and music videos use prerecorded vocal backing tracks. A performer in a motion picture film or television show may sing a song or mime singing a song to a prerecorded vocal backing track. Similarly, in a music video, the performer either sings, or pretends to sing, to a prerecorded backing track. In the above scenarios, the final result is generally an audio recording of the prerecorded vocal backing track combined with a visual recording of the performer miming or singing to the prerecorded vocal track. SUMMARY The Inventor, through extensive experience in performance technology for major touring acts, has identified significant drawbacks in current prerecorded vocal backing track usage. First, while the prerecorded vocal backing track is in use, the vocalist's timing is critical. The vocalist needs to carefully mime or mimic the performance and make sure that their lip and mouth movements follow the prerecorded vocal backing track. Second, when the prerecorded vocal backing track is used to replace segments of a vocalist's live singing, unique nuances of their live performance, such as deliberate changes in timing, pitch, vibrato, and emphasis, are lost. In motion picture, television, and music video production, the performer's timing is also critical, when using prerecorded vocal backing tracks. While the performer may not necessarily be singing to a live audience, their lip and mouth movements, as they sing to or mime the prerecorded vocal backing track, are captured as motion picture images. For this reason, the same issues described in the immediately preceding paragraph, may also apply here. For example, mis-synchronization of the performer's lip movement may require editing out the mis-synchronized portions, or reshooting the scene. The Inventor's systems, devices, and methods, overcome the timing issues discussed above. It does so by dynamically controlling timing of a prerecorded vocal backing track, so it is time-synchronized to a vocal performance. For example, the timing of the prerecorded vocal backing track may be dynamically controlled by using vocal elements extracted from a vocal stream of the vocalist's performance; then matching the extracted vocal elements to timestamped vocal elements from the prerecorded vocal backing track; and using the matched vocal elements to manipulate the timing of the prerecorded vocal backing track. This can be carried out in realtime, but may also be carried out offline. Examples of vocal elements include phonemes, vector embeddings, or vocal audio spectra. The Inventor's systems, devices, and methods overcome the self-expression issue by identifying prosody factors such as vibrato, accent, stress, and level (loudness or volume) in the vocal stream of the vocalist's performance. These prosody factors are then applied, within a preset range, to corresponding prosody factors in the prerecorded vocal backing track in realtime or in non-realtime, depending on the application. Typically, the prerecorded vocal backing track may be preprocessed to identify, extract, and timestamp vocal elements such as phonemes, vector embeddings, or v