CN-121986377-A - Audio signal comparison with external normalization

CN121986377ACN 121986377 ACN121986377 ACN 121986377ACN-121986377-A

Abstract

An audio processing system is disclosed that compares a query audio sample with a database of multiple reference audio samples using external normalization. The system includes at least one processor and a memory storing instructions that, when executed by the processor, cause the system to determine an externally normalized bias term based on a time-frequency pattern of query audio samples. The system also compares the query audio sample with each reference audio sample to generate a similarity score for each comparison. The system combines the bias term with each similarity score to produce a normalized similarity score. The normalized similarity score is then compared to a threshold to generate a comparison result, which is then output.

Inventors

G. Vichem
D. Blalios
F. G. Germanic
J. LEROUX

Assignees

三菱电机株式会社

Dates

Publication Date: 20260505
Application Date: 20240723
Priority Date: 20231106

Claims (20)

1. An audio processing system that compares a query audio sample to a database of a plurality of reference audio samples using external normalization, the audio processing system comprising a processor and a memory having stored thereon instructions that, when executed by the processor, cause the audio processing system to: Determining the externally normalized bias term based on a time-frequency pattern of the query audio samples; comparing the query audio sample with each of the reference audio samples to generate a similarity score for each comparison; Combining the bias term with the similarity score for each comparison to produce a normalized similarity score; comparing the normalized similarity score with a threshold to produce a comparison result, and And outputting the comparison result.
2. The audio processing system of claim 1, wherein to determine the bias term, the processor is configured to: comparing the query audio sample with a set of training audio samples to produce a set of training similarity metrics, and The bias term is determined based on an average of training similarity metrics of K closest training audio samples.
3. The audio processing system of claim 2, wherein to determine the bias term, the processor is configured to: The average of the training similarity metrics is scaled by a scalar to produce the bias term.
4. The audio processing system of claim 3, wherein the scalar is a function of diversity of time-frequency modes in the training audio samples.
5. The audio processing system of claim 1, wherein to determine the bias term, the processor is configured to: extracting a time-frequency pattern of the query audio samples, and The extracted time-frequency pattern is processed with a predetermined analysis function to produce the bias term.
6. The audio processing system of claim 1, wherein to determine the bias term, the processor is configured to: extracting a time-frequency pattern of the query audio samples, and The extracted time-frequency pattern is processed using a learning function trained with machine learning to produce the bias term.
7. The audio processing system of claim 6, wherein the learning function is trained using supervised machine learning using bias terms determined based on an average similarity measure of training audio samples.
8. The audio processing system of claim 1, wherein to compare the query audio sample to a reference audio sample, the processor is configured to: calculating a Mel-frequency spectrogram of the query audio sample and the reference audio sample, and The similarity score between the query audio sample and the reference audio sample is determined based on cosine similarity of the calculated mel-frequency spectrogram.
9. The audio processing system of claim 8, wherein the processor is configured to normalize the mel-frequency spectrogram using internal normalization.
10. The audio processing system of claim 8, wherein the processor is configured to calculate the mel-frequency spectrogram at a coarse resolution having less than 20 mel-frequency bands and a spacing between consecutive time windows of greater than 20 ms.
11. The audio processing system of claim 1, wherein to compare the query audio sample to a reference audio sample, the processor is configured to: calculating an embedding of the query audio sample and the reference audio sample using a neural network, and The similarity score between the query audio sample and the reference audio sample is determined based on the computed embedded cosine similarity.
12. The audio processing system of claim 1, wherein the database of the plurality of reference audio samples comprises the query audio samples such that the reference audio samples are compared to each other, wherein the processor is further configured to: A database of the plurality of reference audio samples is condensed upon detection of a copy indicated by the comparison result.
13. The audio processing system of claim 12, wherein the processor is further configured to: An audio deep learning model is trained using a reduced database of a plurality of reference audio samples.
14. The audio processing system of claim 1, wherein the processor is further configured to: Training an audio generation model to generate audio samples using the database of the plurality of reference audio samples, and The generated audio samples are compared to the reference audio samples using the external normalization to detect whether at least some of the generated audio samples are copies of audio samples contained in a database of the plurality of reference audio samples.
15. The audio processing system of claim 1, wherein the processor is further configured to: Performing an audio generation model trained using a database of the plurality of reference audio samples to generate audio samples, and The generated audio samples are transmitted unless the audio samples generated using the external normalization are copies of one or more of the reference audio samples.
16. The audio processing system of claim 1, wherein the processor is further configured to: anomaly detection of the query audio sample is performed based on the comparison result.
17. An audio processing method that compares a query audio sample with a database of a plurality of reference audio samples using external normalization, wherein the method uses a processor coupled with stored instructions that implement the method, wherein the instructions, when executed by the processor, perform the steps of the method, the method comprising the steps of: Determining the externally normalized bias term based on a time-frequency pattern of the query audio samples; comparing the query audio sample with each of the reference audio samples to generate a similarity score for each comparison; combining the bias term with each of the similarity scores to produce a normalized similarity score; Comparing the normalized similarity score to a threshold to produce a comparison result, and And outputting the comparison result.
18. The audio processing method according to claim 17, further comprising the step of: comparing the query audio sample with a set of training audio samples to produce a set of training similarity metrics, and The bias term is determined based on an average of training similarity metrics of K closest training audio samples.
19. The audio processing method according to claim 17, further comprising the step of: Calculating a Mel spectrum plot of the query audio sample and the reference audio sample, wherein the Mel spectrum plot is calculated at a coarse resolution having less than 20 Mel bands and a spacing between consecutive time windows greater than 20 ms, and The similarity score between the query audio sample and the reference audio sample is determined based on cosine similarity of the calculated mel-frequency spectrogram.
20. A non-transitory computer readable storage medium having embodied thereon a program executable by a processor to perform a method of comparing a query audio sample with a database of a plurality of reference audio samples using external normalization, the method comprising: Determining the externally normalized bias term based on a time-frequency pattern of the query audio samples; comparing the query audio sample with each of the reference audio samples to generate a similarity score for each comparison; combining the bias term with each of the similarity scores to produce a normalized similarity score; Comparing the normalized similarity score to a threshold to produce a comparison result, and And outputting the comparison result.

Description

Audio signal comparison with external normalization Technical Field The present disclosure relates generally to audio signal processing, and more particularly to systems and methods for comparing audio signals. Background Previous methods for comparing audio samples to a database of reference audio samples typically rely on conventional audio processing techniques. These methods generally involve analyzing the audio samples using various signal processing algorithms to extract relevant features such as spectral content, temporal patterns, time-frequency landmarks, and amplitude variations. However, when attempting to quantify that two audio signals match, but not an exact copy (copy), these methods may not adequately account for variations in total energy level, frequency balance, and signal length between the query audio sample and the reference audio sample without normalization. In some cases, audio processing systems have attempted to solve this problem by applying normalization techniques that adjust the overall characteristics of the query audio sample itself. While these techniques may partially mitigate the effects of loudness, overall frequency balance, and variations in signal length, they may not provide a comprehensive solution that considers the particular time-frequency pattern of the query audio sample. Other approaches focus on comparing audio samples using statistical methods such as dynamic time warping or hidden markov models. These methods aim at aligning the query audio sample with the reference audio sample by taking into account the temporal relationship between the different segments of the audio signal. However, these methods may not provide accurate results when comparing audio samples with significant changes in loudness or energy level or with slight frequency shifts between spectral components. Accordingly, there remains a need in the art for an audio processing system for a comprehensive comparison method to accurately compare a query audio sample to a database of multiple reference audio samples. Disclosure of Invention It is an object of some embodiments to provide a system and method for comparing audio signals or audio samples. Additionally or alternatively, it is an object of some embodiments to provide a system and method for comparing audio samples to each other and/or to a plurality of other audio samples. Some implementations are based on the recognition that the audio samples need to be normalized for fair comparison. For example, the audio samples may be modified to have the same length and loudness. Such normalization modifies the audio samples themselves and is referred to herein as internal normalization. However, this internal normalization is not sufficient to fairly compare different audio samples. One of the reasons for this is that the similarity of two audio samples depends on how the frequency content of the audio samples changes over time, which can also be described as comparing the time-frequency pattern between the two audio samples. It should be appreciated that some sounds may have a time-frequency pattern that matches well with a variety of other sounds. An extreme example of this is white noise, which has energy at all frequencies and is constant across time, and can be well matched to any sound with constant frequency characteristics across time. Alternatively, a single short click sound surrounded by silence may be well matched to any other sound of short duration, regardless of the frequency content of the click. This problem remains even after the sound is normalized in terms of volume and length prior to comparison. Therefore, in addition to internal normalization of the audio samples, normalization of the audio comparison is also required. This normalization is referred to herein as external normalization, which does not modify the audio samples themselves, but rather modifies the result of the comparison to make it more fair to different kinds of audio samples. External normalization allows multiple audio queries to be compared to determine whether a given query matches any reference samples, without requiring a query-specific threshold. Using external normalization, a single threshold may be used to detect matches across multiple types of audio queries. In some implementations, the external normalization determines a bias term based on a time-frequency pattern of the query audio samples and adds the bias term to a similarity score generated by comparing the query audio samples to other reference audio samples. Thus, in addition to or instead of normalizing the audio samples by internal normalization, some embodiments normalize the similarity score by external normalization. In some aspects, the technology described herein relates to an audio processing system that compares a query audio sample with a database of multiple reference audio samples using external normalization, the audio processing system comprising at least one processor, and a