US-20260126948-A1 - Method and System for Mid-Track Cuepoint Detection and Use

US20260126948A1US 20260126948 A1US20260126948 A1US 20260126948A1US-20260126948-A1

Abstract

An example computer-implemented method includes selecting an audio segment from a middle section of an audio track and the computing system obtaining a frequency-component representation of a time window that spans (i) the selected audio segment and (ii) context audio before and/or after the selected audio segment. Further, the example method includes providing, to a trained machine-learning model, the frequency-component representation, the trained machine-learning model having been trained by training data that identifies cuepoints within frequency-component representations of audio segments within beginning and end sections of a plurality of training audio tracks, each cuepoint being a fade-in cuepoint or a fade-out cuepoint. Still further, the example method includes obtaining, from the trained machine-learning model, based on the provided frequency-component representation, a prediction that a mid-track cuepoint is present in the selected audio segment, and the computing system generating metadata for the audio track based on the prediction.

Inventors

Daniel Stoller
Nicola Montecchio

Assignees

SPOTIFY AB

Dates

Publication Date: 20260507
Application Date: 20250428

Claims (20)

1 . A computer-implemented method comprising: selecting, from an audio track, an audio segment, wherein the audio track contains a beginning section, a middle section, and an end section, and wherein the selected audio segment is in the middle section; obtaining a frequency-component representation of a time window spanning (i) the selected audio segment and (ii) at least one of context audio before the selected audio segment or context audio after the selected audio segment; providing, to a trained machine-learning model, the frequency-component representation, wherein the trained machine-learning model has been trained by training data that identifies cuepoints within frequency-component representations of audio segments within beginning and end sections of a plurality of training audio tracks, wherein each identified cuepoint is a fade-in cuepoint or a fade-out cuepoint; obtaining, from the trained machine-learning model, based on the provided frequency-component representation, a prediction that a mid-track cuepoint is present in the selected audio segment within the middle section of the audio track; and generating metadata for the audio track based on the prediction.
2 . The computer-implemented method of claim 1 , wherein the training data is devoid of any identification of mid-track cuepoints.
3 . The computer-implemented method of claim 1 , wherein the training data includes, for each of the plurality of training audio tracks, a plurality of training data sets each comprising: a frequency-component representation of a respective training time window of the training audio track, wherein the respective training time window spans (i) a respective training audio segment randomly selected from the training audio track and (ii) respective context audio before and after the respective training audio segment; and an indication of whether and if so where in the respective training audio segment of the training audio track there is a fade-in or fade-out cuepoint.
4 . The computer-implemented method of claim 3 , wherein the respective context audio is devoid of any indicated cuepoints.
5 . The computer-implemented method of claim 3 , wherein a given training time window extends beyond a beginning or end of the training audio track, and wherein the context audio in the given training time window is silence padded.
6 . The computer-implemented method of claim 3 , wherein in each respective training time window, the respective training audio segment and the context audio before and after the respective training audio segment each have a duration in a range of 10 to 30 seconds.
7 . The computer-implemented method of claim 6 , wherein the duration is 20 seconds.
8 . The computer-implemented method of claim 1 , further comprising providing the generated metadata to facilitate playing out audio from the predicted mid-track cuepoint in the audio track.
9 . The computer-implemented method of claim 1 , wherein the audio segment is a first audio segment, the prediction is a first prediction, and the mid-track cuepoint is a first mid-track cuepoint, the method further comprising: repeating the method for a second audio segment in the middle section of the audio track, including obtaining from the trained machine-learning model a second prediction that a second mid-track cuepoint is present in the second audio segment; and extracting, based on the first prediction and the second prediction, a mid-track section of the audio track extending from the first mid-track cuepoint to the second mid-track cuepoint.
10 . The computer-implemented method of claim 9 , further comprising: providing the extracted mid-track section of the audio track as a preview of the audio track.
11 . A computing system comprising: at least one processor; non-transitory data storage; and program instructions stored in the non-transitory data storage and executable by the at least one processor to cause the computing system to carry out operations comprising: selecting, from an audio track, an audio segment, wherein the audio track contains a beginning section, a middle section, and an end section, and wherein the selected audio segment is in the middle section, obtaining a frequency-component representation of a time window spanning (i) the selected audio segment and (ii) at least one of context audio before the selected audio segment or context audio after the selected audio segment, providing, to a trained machine-learning model, the frequency-component representation, wherein the trained machine-learning model has been trained by training data that identifies cuepoints within frequency-component representations of audio segments within beginning and end sections of a plurality of training audio tracks, wherein each identified cuepoint is a fade-in cuepoint or a fade-out cuepoint, obtaining, from the trained machine-learning model, based on the provided frequency-component representation, a prediction that a mid-track cuepoint is present in the selected audio segment within the middle section of the audio track, and generating metadata for the audio track based on the prediction.
12 . The computing system of claim 11 , wherein the training data is devoid of any identification of mid-track cuepoints.
13 . The computing system of claim 11 , wherein the training data includes, for each of the plurality of training audio tracks, a plurality of training data sets each comprising: a frequency-component representation of a respective training time window of the training audio track, wherein the respective training time window spans (i) a respective training audio segment randomly selected from the training audio track and (ii) respective context audio before and after the respective training audio segment; and an indication of whether and if so where in the respective training audio segment of the training audio track there is a fade-in or fade-out cuepoint.
14 . The computing system of claim 13 , wherein the respective context audio is devoid of any indicated cuepoints.
15 . The computing system of claim 13 , wherein a given training time window extends beyond a beginning or end of the training audio track, and wherein the context audio in the given training time window is silence padded.
16 . The computing system of claim 13 , wherein in each respective training time window, the respective training audio segment and the context audio before and after the respective training audio segment each have a duration in a range of 10 to 30 seconds.
17 . The computing system of claim 11 , wherein the operations further include providing the generated metadata to facilitate jumping to the predicted mid-track cuepoint in the audio track.
18 . The computing system of claim 11 , wherein the audio segment is a first audio segment, the prediction is a first prediction, and the mid-track cuepoint is a first mid-track cuepoint, the operations further comprising: repeating the operations for a second audio segment in the middle section of the audio track, including obtaining from the trained machine-learning model a second prediction that a second mid-track cuepoint is present in the second audio segment; extracting, based on the first prediction and the second prediction, a mid-track section of the audio track extending from the first mid-track cuepoint to the second mid-track cuepoint; and providing the extracted mid-track section of the audio track as a preview of the audio track.
19 . Non-transitory data storage having stored program instructions executable by at least one processor of a computing system to cause the computing system to carry out operations comprising: selecting, from an audio track, an audio segment, wherein the audio track contains a beginning section, a middle section, and an end section, and wherein the selected audio segment is in the middle section; obtaining a frequency-component representation of a time window spanning (i) the selected audio segment and (ii) at least one of context audio before the selected audio segment or context audio after the selected audio segment; providing, to a trained machine-learning model, the frequency-component representation, wherein the trained machine-learning model has been trained by training data that identifies cuepoints within frequency-component representations of audio segments within beginning and end sections of a plurality of training audio tracks, wherein each identified cuepoint is a fade-in cuepoint or a fade-out cuepoint; obtaining, from the trained machine-learning model, based on the provided frequency-component representation, a prediction that a mid-track cuepoint is present in the selected audio segment within the middle section of the audio track; and generating metadata for the audio track based on the prediction.
20 . The non-transitory data storage of claim 19 , wherein the training data is devoid of any identification of mid-track cuepoints, wherein the training data includes, for each of the plurality of training audio tracks, a plurality of training data sets each comprising (a) a frequency-component representation of a respective training time window of the training audio track, wherein the respective training time window spans (i) a respective training audio segment randomly selected from the training audio track and (ii) respective context audio before and after the respective training audio segment and (b) an indication of whether and if so where in the respective audio segment of the training audio track there is a fade-in or fade-out cuepoint, and wherein the respective context audio is devoid of any indicated cuepoints.

Description

REFERENCE TO RELATED APPLICATION This application claims priority to U.S. Provisional Ser. No. 63/717,539 , filed Nov. 7, 2024, the entirety of which is hereby incorporated by reference. TECHNICAL FIELD The present disclosure relates to the field of digital audio content and, more specifically, to machine-based detection and use of audio cuepoints. BACKGROUND Cuepoints in audio are markers that define particular moments in the audio, such as the start or end of a verse, a chorus, or another segment for instance. These markers may serve many useful purposes. For instance, the markers may serve as reference points to allow disc-jockeys (DJs) or other users to instantly jump to temporal locations in an audio track. Further, the markers may serve as transition points, allowing seamless transition between streaming and/or playout of audio tracks, such as by beginning a transition at a designated end cuepoint (or fade-out cuepoint) near the end of one track and finishing the transition at a designated start cuepoint (or fade-in cuepoint) near the beginning of a next track. SUMMARY Machine-based processing can be used to predict where start and end cuepoints are located within audio tracks, in order to facilitate fading playout from one track to another. In particular, a trained machine-learning model (e.g., neural network) could work well to predict the locations of start and end cuepoints in a given audio track if the training data used to train the model includes labeled start and end cuepoints in each of many tracks. A machine-learning model that is trained based on audio tracks labeled as to start and end cuepoints, however, may not work well to predict the occurrence of intervening cuepoints within the audio track, such as points of transition between verses, choruses, etc., of the audio track for instance. One reason for this technical issue is that such training data would teach the model about start and end cuepoints that are within the starting and ending portions of audio tracks, and those starting and ending portions may be characteristically different than intervening portions of the audio tracks. For instance, the starting or ending portions of a song may differ from the main content of a song in terms of musical theme, presence or absence of vocal content, and repetition of musical structure, among other possibilities. One potential approach to facilitate predicting mid-track cuepoints is to train a machine-learning model based on labeled mid-track cuepoints in particular, perhaps many labeled mid-track cuepoints per track for potentially thousands of audio tracks. Unfortunately, however, given the ever-increasing extent of content available for streaming and other playout, it may be impractical to label all of those mid-track cuepoints. Further, training the model based on so many cuepoint labels may be computationally expensive and/or may require additional data storage and present other technical difficulties. The present disclosure provides a technical mechanism to help overcome this issue, facilitating machine-based prediction of mid-track cuepoints using a machine-learning model that is trained based on labeled start and/or end cuepoints, potentially without a need for any advanced labeling of mid-track cuepoints. In particular, the disclosure provides for training a machine-learning model based on small audio segments from randomly selected time positions throughout training audio tracks, along with associated context audio per audio segment and an indication per audio segment of whether and if so where the audio segment includes a start or end cuepoint. Even though this training data is thus based on start and end cuepoints in particular, the small size, context audio, and random time positions of the training audio segments throughout training audio tracks provides technical advantages by usefully teaching the machine-learning model more broadly what constitutes a cuepoint, thus enabling the machine-learning model to predict not just start and end cuepoints but also mid-track cuepoints. Accordingly, in one respect, disclosed is an example computer-implemented method. The method includes selecting, from an audio track, an audio segment, the audio track containing a beginning section, a middle section, and an end section, and the selected audio segment being in the middle section. Further, the method includes obtaining a frequency-component representation of a time window that spans (i) the selected audio segment and (ii) context audio before and after the selected audio segment. Still further, the method includes providing the frequency-component representation to a trained machine-learning model that has been trained by training data that identifies start and end cuepoints within frequency-component representations of audio segments within beginning and end sections of a plurality of training audio tracks. Yet further, the method includes obtaining, from the trained machine-learning model,