EP-4576063-B1 - AUTOMATIC GENERATION AND MULTIPLICATION OF AUDIO STEMS USING INCREMENTAL LEARNING

EP4576063B1EP 4576063 B1EP4576063 B1EP 4576063B1EP-4576063-B1

Inventors

STAVITSKII, Oleg
Terekhov, Vladimir
EVGRAFOV, Dmitry
PETRENKO, Philipp
BEZUGLY, Dmitry
GURZHIY, Evgeny
Skovorodkin, Igor
KUDRYAVTSEV, Andey
BULATSEV, Kyrylo

Dates

Publication Date: 20260506
Application Date: 20240112

Claims (15)

A method for audio processing, the method comprising: obtaining a set of input audio stems, the set of input audio stems comprising a selected subset of a plurality of audio stems; obtaining configuration information corresponding to audio processing operations for the set of input audio stems; generating, using an audio stem multiplier engine, a plurality of multiplied audio stems based on the set of input audio stems and the configuration information, wherein each respective multiplied audio stem comprises a variation of a particular input audio stem included in the set of input audio stems, and wherein each respective multiplied audio stem is generated based on applying one or more audio processing operations parameterized by the configuration information; receiving information indicative of user feedback ratings for each respective multiplied audio stem of the plurality of multiplied audio stems; and generating ( 610 ), using the audio stem multiplier engine, a second plurality of multiplied audio stems based on at least the user feedback ratings and one or more of the set of input audio stems or the plurality of multiplied audio stems.
The method of claim 1, wherein a quantity of audio stems included in the plurality of multiplied audio stems is greater than a quantity of audio stems included in the set of input audio stems.
The method of claim 1 or 2, further comprising: outputting the plurality of multiplied audio stems for playback to a user; and receiving the information indicative of the user feedback ratings based on outputting the plurality of multiplied audio stems for playback.
The method of claim 3, wherein: the user feedback ratings comprise a positive user feedback rating or a negative user feedback rating for each respective audio stem variation included in the plurality of multiplied audio stems.
The method of any previous claim, wherein applying the one or more audio processing operations to generate a respective multiplied audio stem includes: receiving configuration information indicative of a desired tone-shaping adjustment; processing, using a machine learning tone-shaping model, at least one input audio stem of the set of input audio stems to generate a corresponding one or more tone-shaped audio stems as output, wherein the machine learning tone-shaping model process the at least one audio stem based on the desired tone-shaping adjustment; and outputting the corresponding one or more tone-shaped audio stems within the plurality of multiplied audio stems.
The method of any previous claim, wherein: the set of input audio stems includes one or more multiplied audio stems generated as output in a previous processing round performed by the audio stem multiplier engine.
The method of claim 6, wherein: the configuration information for the set of input audio stems includes the user feedback ratings information for each of the one or more multiplied audio stems generated as output in the previous processing round.
The method of any previous claim, wherein the configuration information for the set of input audio stems is indicative of a selected one or more audio processing operations or audio effects processing modules to be applied by the audio stem multiplier engine to generate the plurality of multiplied audio stems from the set of input audio stems.
The method of claim 8, wherein the selected one or more audio processing operations or audio effects processing modules are selected based on one or more user inputs or based on user feedback ratings associated with a previous processing round performed by the audio stem multiplier engine.
A system for audio processing, the system comprising: at least one processor; and at least one memory storing instructions, which when executed cause the at least one processor to perform actions comprising: obtaining a set of input audio stems, the set of input audio stems comprising a selected subset of a plurality of audio stems; obtaining configuration information corresponding to audio processing operations for the set of input audio stems; generating, using an audio stem multiplier engine, a plurality of multiplied audio stems based on the set of input audio stems and the configuration information, wherein each respective multiplied audio stem comprises a variation of a particular input audio stem included in the set of input audio stems, and wherein each respective multiplied audio stem is generated based on applying one or more audio processing operations parameterized by the configuration information; receiving information indicative of user feedback ratings for each respective multiplied audio stem of the plurality of multiplied audio stems; and generating, using the audio stem multiplier engine, a second plurality of multiplied audio stems based on at least the user feedback ratings and one or more of the set of input audio stems or the plurality of multiplied audio stems.
The system of claim 10, wherein a quantity of audio stems included in the plurality of multiplied audio stems is greater than a quantity of audio stems included in the set of input audio stems.
The system of claim 10 or 11, wherein the at least one processor is caused to further perform: outputting the plurality of multiplied audio stems for playback to a user; and receiving the information indicative of the user feedback ratings based on outputting the plurality of multiplied audio stems for playback, wherein optionally: the user feedback ratings comprise a positive user feedback rating or a negative user feedback rating for each respective audio stem variation included in the plurality of multiplied audio stems.
The system of any of claims 10 to 12, wherein applying the one or more audio processing operations to generate a respective multiplied audio stem includes: receiving configuration information indicative of a desired tone-shaping adjustment; processing, using a machine learning tone-shaping model, at least one input audio stem of the set of input audio stems to generate a corresponding one or more tone-shaped audio stems as output, wherein the machine learning tone-shaping model process the at least one audio stem based on the desired tone-shaping adjustment; and outputting the corresponding one or more tone-shaped audio stems within the plurality of multiplied audio stems.
The system of any of claims 10 to 13, wherein: the set of input audio stems includes one or more multiplied audio stems generated as output in a previous processing round performed by the audio stem multiplier engine, wherein optionally: the configuration information for the set of input audio stems includes the user feedback ratings information for each of the one or more multiplied audio stems generated as output in the previous processing round.
The system of any of claims 10 to 14, wherein the configuration information for the set of input audio stems is indicative of a selected one or more audio processing operations or audio effects processing modules to be applied by the audio stem multiplier engine to generate the plurality of multiplied audio stems from the set of input audio stems, wherein optionally the selected one or more audio processing operations or audio effects processing modules are selected based on one or more user inputs or based on user feedback ratings associated with a previous processing round performed by the audio stem multiplier engine.

Description

FIELD Aspects of the present disclosure generally relate to sound processing. In some implementations, examples are described for the generation and multiplication of audio stems for automatic composition. BACKGROUND Technological innovation, while improving productivity, has increasingly raised stress levels in day-to-day life. The daily demands on life have become more numerous and fast-paced while the level of daily distractions has increased. New systems need to be implemented in order to address this. Individual attempts to deal with these stress-causing issues frequently involve activities such as meditation and exercise, often accompanied by music or soundscapes to augment the experience. However, these soundscapes are generally homogenous, of limited length and are not adaptive to a user's evolving environment or state, and cannot dynamically access information relevant to an individual's state and surroundings to present a personalized transmission(s) of sound for various activities, such as relaxation, focus, sleep, exercise, etc. Music or soundscapes can additionally be used to accompany storytelling activities, which can include spoken-word storytelling and/or written-word storytelling, among various other forms. For example, audio compositions underlying storytelling can augment the experience by conveying richer information to a user, for instance by aurally conveying the mood, tone, or style of a story (or portion thereof). This contextual information can encapsulate various different elements or themes of a written work, whether it be the rapid and anxious tone of a suspenseful event, or the calm and quiet moments of a sunny day in nature. Using audio compositions to aurally convey contextual or other related information for a textual work can improve comprehension and focus for a reader or listener. For example, aurally conveyed contextual information corresponding to a textual work may better engage a reader with the storyline, by deepening the reader's connections to the events of a particular scene, character, etc. Augmenting a first information-conveying modality (e.g., text) with contextual information presented via a second information-conveying modality (e.g., audio composition or soundscape) can provide a more immersive and captivating experience for users. US2020357371 discloses training an artificial intelligence (AI) for composing music songs: The AI generates and positions new song parts or defines variations of an initial song part. An initial incomplete song selected by the user is analysed by the AI, to generate a completed song for review by the user, consisting of: Intro, ending, user generated initial music fragment, and additional song parts generated by the AI, thus defining structure and musical content, an iterative process. Both the loop-and the project- databases are living databases where audio loops or projects can be added, deleted, or changed in a recurring manner. WO2015154159 defines combining stems in audio production, controlled by a Graphical User Interface. Selected stem criteria are receiving via the GUI; filtered stems that match the criteria are obtaining from a stem database; song-starters are displayed for user selection. Additional stems and audio effects are added by the user on a per-loop or per-stem basis. The higher a user feedback about a given stem, the higher the melodic compatibility ranking between the stem and the song-starter. SUMMARY The invention is as defined by the appended claims. The foregoing, together with other features and aspects, will become more apparent upon referring to the following specification, claims, and accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings are presented to aid in the description of various aspects of the disclosure and are provided solely for illustration of the aspects and not limitation thereof. In order to describe the manner in which the above-recited and other advantages and features of the disclosure can be obtained, a more particular description of the principles briefly described above will be rendered by reference to specific embodiments thereof, which are illustrated in the appended drawings. In the drawings, like reference numbers can indicate identical or functionally similar elements. Understanding that these drawings depict only example embodiments of the disclosure and are not therefore to be considered to be limiting of its scope, the principles herein are described and explained with additional specificity and detail through the use of the accompanying drawings in which: FIG. 1 illustrates an example architecture of a network for implementing a method for creating a personalized sound environment for a user, in accordance with some examples;FIG. 2 is a flow diagram illustrating an example of a process of automatic composition that may be used to create a personalized sound environment for a user, in accordance with some examples;FIG. 3 is a flow diagram ill