JP-2026076329-A - Latent spatial representation of audio signals for audio content-based capture

JP2026076329AJP 2026076329 AJP2026076329 AJP 2026076329AJP-2026076329-A

Abstract

[Problem] To provide a method and system for training an artificial neural network model to extract features that indicate fluctuations in psychoacoustic attributes from digital audio signals and generate contextual latent space representations. [Solution] The learning system consists of a read/decode logic that reads a specific digital audio signal source from a set of digital audio signal sources associated with a specific sound content category, a transform logic that generates a time-frequency representation based on the specific digital audio signal, and a learning logic that uses an artificial neural network to learn a set of numerical codes that provide a latent spatial representation of the time-frequency representation. The reading, generating, and learning process is repeated to train the artificial neural network. The model database retrieves a set of model parameters learned from the trained artificial neural network and stores them in a computer storage medium. [Selection Diagram] Figure 1

Inventors

コレツキー、アレハンドロ
ラジャシェカラッパ、ナヴィーンササル

Assignees

ディストリビューテッドクリエーションインコーポレイテッド

Dates

Publication Date: 20260511
Application Date: 20260217
Priority Date: 20200729

Claims (17)

It is a method, Reading a specific digital audio signal source from a set of digital audio signal sources associated with a specific sound content category, To generate a time-frequency representation based on the aforementioned specific digital audio signal, Learning a set of numerical codes that provide a latent spatial representation of the time-frequency representation using an artificial neural network, wherein the set of numerical codes has a dimension less than the dimension of the time-frequency representation. To train the artificial neural network, the process of reading, generating, and learning is repeated for each of the multiple other digital audio signal sources in the set of digital audio signal sources of the specific sound content category. From the aforementioned trained artificial neural network, obtain the set of learned model parameters, The method comprising storing the learned set of model parameters for the particular sound content category in a computer storage medium.
To generate a first time-frequency representation based on a first digital audio signal associated with the aforementioned specific sound content category, To compute a first set of numerical codes that provide a latent spatial representation of the first time-frequency representation, the learned set of model parameters is used, To generate a second time-frequency representation based on a second digital audio signal associated with the aforementioned specific sound content category, To compute a second set of numerical codes that provide a latent spatial representation of the second time-frequency representation, the learned set of model parameters is used, The method according to claim 1, further comprising calculating the distance between a first set of numerical codes and a second set of numerical codes.
The method according to claim 2, further comprising causing a computer graphical user interface to display a message indicating that the first digital audio signal and the second digital audio signal contain similar sounds, based on the distance.
The further includes receiving the first digital voice signal from a computing device via a data communication network, The first digital audio signal is captured by the computing device using a microphone that is coupled to the computing device for recording human audible performance, or a microphone that is coupled to the computing device for recording human audible performance. The method according to claim 2.
Comparing the aforementioned distance with a distance threshold, The method according to claim 2, further comprising causing a computer graphical user interface to display a message indicating that the first digital audio signal and the second digital audio signal contain sounds that substantially overlap, based on the distance being less than the distance threshold.
The method according to claim 5, further comprising selecting the distance threshold based on the specific sound content category.
The method of claim 5, further comprising calculating the distance between the first set of numerical codes and the second set of numerical codes based on the cosine similarity between the first set of numerical codes and the second set of numerical codes.
The method according to claim 1, wherein the specific sound content category is selected from the group consisting of loops and one-shots.
The method according to claim 1, wherein the specific sound content category is selected from the group consisting of drum loops, drum one-shots, instrument loops, and instrument one-shots.
The method according to claim 1, wherein the artificial neural network comprises an input layer, one or more encoder intermediate layers, a bottleneck layer, one or more decoder intermediate layers, and an output layer, and the method further comprises obtaining from the bottleneck layer a set of numerical codes that provide the latent spatial representation of the time-frequency representation.
Claim 10, wherein the one or more encoder intermediate layers comprises one or more convolutional layers, and the one or more decoder intermediate layers comprises one or more convolutional layers. Methods used.
The time-frequency representation is generated based on the aforementioned specific digital audio signal. Acquiring a pre-processed signal based on the aforementioned specific digital audio signal, A duration-normalized signal is generated based on the aforementioned pre-processed signal, The selection of a time-shifted slice signal of the duration-normalized signal, wherein the time-shifted slice signal has a specific duration, The method according to claim 1, wherein the time-frequency representation is generated based on the time-shifted slice signal.
The method according to claim 12, further comprising selecting the specific duration based on the specific sound content category.
The time-frequency representation is generated based on the aforementioned specific digital audio signal. The process involves generating a duration-normalized signal based on the aforementioned specific digital audio signal, To generate a preprocessed signal based on the aforementioned duration-normalized signal, Selecting a time-shifted slice signal of the pre-processed signal, wherein the time-shifted slice signal has a specific duration. The method according to claim 1, wherein the time-frequency representation is generated based on the time-shifted slice signal.
To generate multiple latent spatial representations based on the aforementioned specific digital audio signal, The method according to claim 1, further comprising using the artificial neural network to learn a plurality of sets of numerical codes that provide a latent space representation among the plurality of latent space representations.
A computing system, One or more hardware processors, Storage media and The computing system comprises an instruction stored in the storage medium, which, when executed by the computing system, causes the computing system to perform the method described in any one of claims 1 to 15.
One or more non-temporary storage media for storing one or more sequences of instructions, wherein when the instructions are executed by one or more computing devices, the one or more computing devices receive any one of claims 1 to 15. One or more non-temporary storage media that perform the method described in the section.

Description

The present invention relates to a computer implementation method and system for learning and using latent spatial representations of digital audio signals, and more particularly to such a computer implementation method and system in connection with audio content-based capture. Psychoacoustics encompasses the study of sound space and the mechanisms of human perception of sound. Unlike using visual information, it is generally more difficult for humans to orally describe specific attributes of sound using objective terminology. For example, there is no widely accepted objective terminology to describe differences in timbre. Various people may describe the same timbre in different ways. For example, some people may describe timbre by the instrument that produces the sound, others by the quality and tone of the sound (e.g., bright, metallic, shrill, too strong and unpleasant, jarring, piercing, etc.), and yet another person may describe timbre by the emotion of the sound (e.g., excited, angry, etc.). Tone can be described as happy, sad, etc. Other elements of sound that are not easily described, especially in music, include rhythm, melody, dynamics, and texture. Despite this difficulty, many existing audio content retrieval computing systems are keyword-based. That is, audio content is tagged (e.g., indexed) with keywords that describe the audio content. Users of such computing systems then use these keywords to search for or browse the desired audio content. Keyword tagging/indexing works well when audio content is tagged/indexed by objective attributes such as artist name, song title, music genre, chromatic pitch, beats per minute, or other objective attributes. However, keyword-based searching or browsing of audio content doesn't work well when it's difficult for users to articulate what they're looking for, or when the attributes of the desired audio content that make it stand out to the user in a psychoacoustic sense are subjective or multi-factory. For example, a user might be looking for a vocal sample that sounds like a particular singer singing a particular melody in a particular way, but isn't necessarily that exact part, melody, or way of singing it. Similarly, a user might be looking for a drum loop that sounds similar to a particular rhythmic pattern, but isn't necessarily identical. Recognizing similar sounds has long been important. Powerful computer implementation techniques for detecting similar sounds exist. The features of digital audio signals used in computer-based sound similarity recognition often include manually selected features such as the spectral centroid, spectral bandwidth, or spectral flatness of the digital audio signal. Manual methods exist for feature selection for sound similarity detection, and these methods offer the advantage of complete knowledge and control over how the digital audio signal is represented, and allow for fine-tuning of the configuration of selected features according to the requirements of the particular implementation at hand. Unfortunately, these methods, In many cases, useful distinguishing features are omitted, useful distinguishing features are not recognized, or the features are largely redundant, rendering the system ineffective. This invention addresses this issue and other needs. The methods described in this section are methods that may be pursued, but they are not necessarily methods that have been previously conceived or pursued. Therefore, unless otherwise indicated, none of the methods described in this section should be considered prior art simply because they are included in this section. Some embodiments of the present invention are shown in the drawings of the accompanying drawings, not as limitations but as examples, where similar reference numerals refer to similar elements. This is a schematic diagram of a system for learning the latent spatial representation of a digital audio signal, according to several embodiments of the present invention.This is a schematic diagram of an artificial neural network in a system for learning the latent spatial representation of a digital audio signal, according to some embodiments of the present invention.This is a schematic diagram of the architecture of an artificial neural network in a system for learning the latent spatial representation of a digital audio signal, according to some embodiments of the present invention.This is a flowchart of the process performed by a system for learning the latent spatial representation of a digital audio signal, according to some embodiments of the present invention.This is a mockup of an exemplary graphical user interface for a similar-sound application in an audio content-based retrieval system, according to several embodiments of the present invention.Figure 5 shows a mockup of a state change in an exemplary graphical user interface in response to end-user input, according to several embodiments of the present invention.This is a schematic diagram of an exem