Search

US-20260127370-A1 - TECHNIQUES FOR AUTOMATICALLY MATCHING RECORDED SPEECH TO SCRIPT DIALOGUE

US20260127370A1US 20260127370 A1US20260127370 A1US 20260127370A1US-20260127370-A1

Abstract

In various embodiments a dialogue matching application performs speech recognition operations on an audio segment to generate a sequence of words. The dialogue matching application determines a first dialogue match between a first subsequence of words included in the sequence of words and a script line included in a set of script lines. The dialogue matching application determines a second dialogue match between a second subsequence of words included in the sequence of words and the script line. The dialogue matching application receives, via a graphical user interface (GUI), an event that corresponds to an interaction between a user and an interactive GUI element. The dialogue matching application extracts a portion of the audio segment from a session recording based on the event to generate an audio clip that corresponds to both the script line and either the first subsequence or words or the second subsequence of words.

Inventors

  • Julien Hoarau

Assignees

  • NETFLIX, INC.

Dates

Publication Date
20260507
Application Date
20251229

Claims (20)

  1. 1 . A computer-implemented method for generating audio clips, the method comprising: generating text representing words spoken in an audio segment; identifying a first subsequence of the text and a second subsequence of the text, wherein the first subsequence and the second subsequence each correspond to a same script line; receiving a selection of the first subsequence or the second subsequence to establish a selected subsequence; and generating an audio clip based on a portion of the audio segment that corresponds to the selected subsequence.
  2. 2 . The computer-implemented method of claim 1 , further comprising displaying information associated with the first subsequence and the second subsequence.
  3. 3 . The computer-implemented method of claim 2 , wherein displaying the information comprises displaying, for at least one of the first subsequence or the second subsequence, at least one of a start timestamp or an end timestamp.
  4. 4 . The computer-implemented method of claim 1 , wherein generating the audio clip comprises: extracting a time interval of the audio segment associated with the selected subsequence; and extracting the audio clip from the audio segment based on the time interval.
  5. 5 . The computer-implemented method of claim 4 , wherein extracting the time interval comprises setting a start time to a start timestamp associated with a first word of the selected subsequence and setting an end time to an end timestamp associated with a last word of the selected subsequence.
  6. 6 . The computer-implemented method of claim 1 , wherein the selection identifies one of the first subsequence or the second subsequence as a preferred take for the script line.
  7. 7 . The computer-implemented method of claim 1 , wherein identifying the first subsequence and the second subsequence comprises: generating tokens based on the text; and matching the tokens to the script line.
  8. 8 . The computer-implemented method of claim 7 , wherein matching the tokens to the script line comprises: searching an index generated from a plurality of script lines to identify candidate script lines; and selecting, based on at least one relevance score, the script line from the candidate script lines.
  9. 9 . The computer-implemented method of claim 1 , further comprising recording an input audio stream in real time to generate a session recording that includes the audio segment.
  10. 10 . The computer-implemented method of claim 1 , wherein the audio segment comprises a continuous portion of speech bounded by pauses or silences that exceed a configurable segment pause threshold.
  11. 11 . One or more non-transitory computer readable media storing instructions that, when executed by one or more processors, cause the one or more processors to generate audio clips, by performing the operations of: generating text representing words spoken in an audio segment; identifying a first subsequence of the text and a second subsequence of the text, wherein the first subsequence and the second subsequence each correspond to a same script line; receiving a selection of the first subsequence or the second subsequence to establish a selected subsequence; and generating an audio clip based on a portion of the audio segment that corresponds to the selected subsequence.
  12. 12 . The one or more non-transitory computer readable media of claim 11 , further comprising maintaining a selection flag associated with each of the first subsequence and the second subsequence, wherein receiving the selection comprises setting the selection flag for the selected subsequence.
  13. 13 . The one or more non-transitory computer readable media of claim 12 , further comprising maintaining script context data that identifies a previously matched script line, wherein identifying the first subsequence and the second subsequence is performed based at least in part on the script context data.
  14. 14 . The one or more non-transitory computer readable media of claim 11 , further comprising displaying information associated with the first subsequence and the second subsequence.
  15. 15 . The one or more non-transitory computer readable media of claim 14 , wherein displaying the information comprises displaying, for at least one of the first subsequence or the second subsequence, at least one of a start timestamp or an end timestamp.
  16. 16 . The one or more non-transitory computer readable media of claim 11 , wherein generating the audio clip comprises: extracting a time interval of the audio segment associated with the selected subsequence; and extracting the audio clip from the audio segment based on the time interval.
  17. 17 . The one or more non-transitory computer readable media of claim 16 , wherein extracting the time interval comprises setting a start time to a start timestamp associated with a first word of the selected subsequence and setting an end time to an end timestamp associated with a last word of the selected subsequence.
  18. 18 . The one or more non-transitory computer readable media of claim 11 , wherein the selection identifies one of the first subsequence or the second subsequence as a preferred take for the script line.
  19. 19 . The one or more non-transitory computer readable media of claim 11 , wherein identifying the first subsequence and the second subsequence comprises: generating tokens based on the text; and matching the tokens to the script line.
  20. 20 . A computer system, comprising: one or more memories that include instructions; and one or more processors that are coupled to the one or more memories and, when executing the instructions, are configured to generate audio clips, by performing the operations of: generating text representing words spoken in an audio segment; identifying a first subsequence of the text and a second subsequence of the text, wherein the first subsequence and the second subsequence each correspond to a same script line; receiving a selection of the first subsequence or the second subsequence to establish a selected subsequence; and generating an audio clip based on a portion of the audio segment that corresponds to the selected subsequence.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of the co-pending U.S. patent application titled, ”TECHNIQUES FOR AUTOMATICALLY MATCHING RECORDED SPEECH TO SCRIPT DIALOGUE,” filed on January 23, 2023 and having Serial Number 18/158,425 which claims priority benefit of the United States Provisional Patent Application titled, ”MATCHING DIALOGUE TO DETECTED SPEECH,” filed on January 24, 2022, and having Serial Number 63/302,480. The subject matter of these related applications is hereby incorporated herein by reference. BACKGROUND Field of the Various Embodiments The various embodiments relate generally to computer science and to audio technology and, more specifically, to techniques for automatically matching recorded speech to script dialogue. Description of the Related Art During a recording session for a dialogue track of an animated film, a voice actor reads dialogue for a particular character from a script, while sometimes optionally improvising, a director provides feedback to the voice actor, and a script coordinator takes written notes of the feedback. In practice, a voice actor often ends up repeatedly reading the same lines of script dialogue in different ways and sometimes at different times during a given recording session. Eventually, the director designates one of the recorded attempts or “takes” as a production take, and that production take is then incorporated into the dialogue track for the film. One particular challenge associated with generating dialogue tracks for animated films is that identifying all of the different production takes included in a given session recording after-the-fact can be quite difficult. In particular, the feedback notes usually map each production take to specific lines of the relevant script. However, these notes typically specify only an approximate time range within the session recording when a given production take occurred. Consequently, determining the proper portions of the session recording to incorporate into the dialogue track can be difficult. In one approach to identifying production takes within a session recording after-the-fact, an editor loads the session recording into audio editing software after the recording session has completed. For each production take specified in the feedback notes, the editor interacts with the audio editing software to iteratively playback portions of the session recording within and proximate to the approximate time range mapped to the production take in the feedback notes. As the audio editing software plays back the different portions of the session recording, the editor listens for at least partial match(es) between the recorded spoken dialogue and the corresponding lines of script in order to locate the actual production take within the session recording. Subsequently, the editor instructs the audio editing software to extract and store the identified production take as the production audio clip for the corresponding lines of script. One drawback of the above approach is that, because tracking each production take involves actually playing back different portions of the session recording, a substantial amount of time (e.g., 4-5 days) can be required to extract production audio clips from a session recording for a typical animated film. Another drawback of the above approach is that tracking production takes based on approximate time ranges is inherently error-prone. In particular, because multiple takes corresponding to the same script lines are oftentimes recorded in quick succession during a recording session, an approximate time range may not unambiguously identify a given production take. If an inferior take is mistakenly identified as a production take, then the quality of the dialogue track is negatively impacted. As the foregoing illustrates, what is needed in the art are more effective techniques for tracking different production takes for inclusion in a dialogue track. SUMMARY One embodiment sets forth a computer-implemented method for automatically generating audio clips. The method includes performing one or more speech recognition operations on a first audio segment to generate a first sequence of words; determining a first dialogue match between a first subsequence of words included in the first sequence of words and a first script line included in a set of script lines; determining a second dialogue match between a second subsequence of words included in the first sequence of words and the first script line; receiving, via a graphical user interface (GUI), a first event that corresponds to a first interaction between a user and a first interactive GUI element;  extracting a first portion of the first audio segment from a session recording based on the first event, where the first portion of the first audio segment corresponds to either the first subsequence of words or the second subsequence of words; and generating a first audio clip that corresponds to the first script line