KR-20260064401-A - EFFICIENT AUDIO SPECTROGRAM TRANSFORMER LEARNING SYSTEM METHOD THROUGH MULTI-LEVEL LEARNING
Abstract
An efficient audio spectrogram transformer learning system and method through multi-stage learning are disclosed. An audio spectrogram transformer learning system according to one embodiment may include: a resolution adjustment unit that adjusts the time-axis resolution of Mel-spectrogram data converted from an audio signal using a time compression method; and a model learning unit that learns a transformer-based model through multi-stage learning (coarse-to-fine) using the Mel-spectrogram data with the time-axis resolution adjusted.
Inventors
- 정준선
- 아르다 세노착
- 펑 지우
- 메흐메트 함자 에롤
Assignees
- 한국과학기술원
Dates
- Publication Date
- 20260507
- Application Date
- 20241114
- Priority Date
- 20241031
Claims (15)
- In an audio spectrogram transformer learning system, A resolution control unit that controls the time-axis resolution of Mel-spectrogram data converted from an audio signal using a time compression method; and A model training unit that trains a Transformer-based model through multi-stage learning (coarse-to-fine) using the above time-axis resolution-adjusted Mel-spectrogram data. An audio spectrogram transformer learning system including
- In paragraph 1, The above audio spectrogram transformer learning system is, The above audio signal is converted into Mel-spectrogram data, and The above-described transformed Mel-spectrogram is divided into patches, and the Mel-spectrogram divided into patches is tokenized into a token sequence through a tokenization layer, and Inputting the above tokenized token sequence into the Transformer model Audio spectrogram transformer learning system characterized by
- In paragraph 1, The above time compression method is, An audio spectrogram transformer learning system characterized by including any one of the following time compression methods: Frame-Shift Change, Max/Avg Pooling, or Flexible Patchification.
- In paragraph 3, The above-mentioned frame shift size change is a time compression method that adjusts the time axis resolution using the frame size and frame shift value, and The above Max/Avg Pooling is a temporal compression method that adjusts temporal axis resolution by adding a max pooling or average pooling layer for kernels and strides of a specific size before tokenizing Mel-spectrograms, and An audio spectrogram transformer learning system characterized by the above-mentioned flexible patchification being a time compression method that adjusts time-axis resolution by applying rectangular patches of a specific size during the tokenization process.
- In paragraph 1, The above resolution adjustment unit is, Reducing the number of tokens in Mel-spectrogram data using a time compression method Audio spectrogram transformer learning system characterized by
- In paragraph 1, The above model learning unit is, Learning approximate feature information using low-resolution audio spectrogram data, and learning detailed information progressively using high-resolution audio spectrogram data in each learning stage after the initial learning stage. Audio spectrogram transformer learning system characterized by
- In paragraph 6, The above model learning unit is, The weights learned in the above initial learning step and each of the above learning steps are transmitted to the next learning step. Audio spectrogram transformer learning system characterized by
- In an audio spectrogram transformer learning method performed by an audio spectrogram transformer learning system, A step of adjusting the temporal axis resolution of Mel-spectrogram data converted from an audio signal using a time compression method; and A step of training a Transformer-based model through multi-stage learning (coarse-to-fine) using the above time-axis resolution-adjusted Mel-spectrogram data. Audio spectrogram transformer learning method including
- In paragraph 8, The above audio spectrogram transformer learning method is, A step of converting the above audio signal into Mel-spectrogram data; A step of dividing the transformed Mel-spectrogram into patches, and tokenizing the Mel-spectrogram divided into patches into token sequences through a tokenization layer; and The step of inputting the above tokenized token sequence into the Transformer model Audio spectrogram transformer learning method including further
- In paragraph 8, The above time compression method is, An audio spectrogram transformer learning method characterized by including any one of the following time compression methods: Frame-Shift Change, Max/Avg Pooling, or Flexible Patchification.
- In Paragraph 10, The above-mentioned frame shift size change is a time compression method that adjusts the time axis resolution using the frame size and frame shift value, and The above Max/Avg Pooling is a temporal compression method that adjusts temporal axis resolution by adding a max pooling or average pooling layer for kernels and strides of a specific size before tokenizing Mel-spectrograms, and An audio spectrogram transformer learning method characterized by the above-mentioned flexible patchification being a time compression method that adjusts time-axis resolution by applying rectangular patches of a specific size during the tokenization process.
- In paragraph 8, The above-mentioned adjusting step is, Step of reducing the number of tokens in Mel-spectrogram data using a time compression method Audio spectrogram transformer learning method including
- In Paragraph 9, The above learning step is, A step of learning approximate feature information using low-resolution audio spectrogram data, and then progressively learning detailed information using high-resolution audio spectrogram data in each subsequent learning step after the initial learning step. Audio spectrogram transformer learning method including
- In Paragraph 13, The above learning step is, The above initial learning step and the step of transferring weights learned in each of the above learning steps to the next learning step Audio spectrogram transformer learning method including
- In a computer program stored on a computer-readable storage medium for executing audio spectrogram transformer learning performed by an audio spectrogram transformer learning system, The above audio spectrogram transformer learning method is, A step of adjusting the temporal axis resolution of Mel-spectrogram data converted from an audio signal using a time compression method; and A step of training a Transformer-based model through multi-stage learning (coarse-to-fine) using the above time-axis resolution-adjusted Mel-spectrogram data. A computer program stored on a computer-readable storage medium that executes.
Description
Efficient Audio Spectrogram Transformer Learning System and Method Through Multi-Level Learning The following description concerns audio spectrogram transformer learning technology. Recently, transformer-based models have been playing a significant role in various tasks, such as audio classification. In particular, the Audio Spectrogram Transformer (AST) has greatly improved audio classification performance by dividing audio spectrograms into patches and using them as input. However, these transformer-based models require substantial resources and time for training, and there is a problem in that the model's complexity increases exponentially as the size of the input audio spectrogram grows. To address this issue, previous studies have proposed methods to optimize transformer complexity by reducing the length of the input sequence. However, existing methods are unable to efficiently process unnecessary information (temporal redundancy) generated along the time axis of the audio spectrogram. FIG. 1 is a diagram illustrating a multi-stage learning operation for efficient audio spectrogram transformer learning in one embodiment. FIG. 2 is a diagram illustrating the audio spectrogram transformer learning operation in one embodiment. FIG. 3 is a block diagram illustrating an audio spectrogram transformer learning system in one embodiment. FIG. 4 is a flowchart illustrating an audio spectrogram transformer learning method in one embodiment. Hereinafter, embodiments will be described in detail with reference to the attached drawings. Conventional audio classification systems process Mel-spectrograms at a fixed resolution and often use high-resolution data from the early stages of training. However, high-resolution data may contain unnecessary temporal redundancy, and during the initial training phase, such details may not significantly contribute to the performance of transformer-based models. This invention eliminates this temporal redundancy and enables transformer-based models to rapidly learn general features with a smaller amount of data. Consequently, transformer-based models can achieve high performance with fewer resources and significantly improve training speed. FIG. 1 is a diagram illustrating a multi-stage learning operation for efficient audio spectrogram transformer learning in one embodiment. The Audio Spectrogram Transformer learning system employs a learning strategy that starts training using various low-resolution audio spectrogram data in the initial phase and then gradually fine-tunes using high-resolution audio spectrogram data in the final phase. Figure 1 is a diagram illustrating the initial and final phases to explain the multi-stage learning operation. In the initial phase, the audio spectrogram data is... A number of tokens can be obtained. At this time, in the initial stage, the frame shift size change compression method, max/average pooling compression method, and flexible patching compression method are shown in order in Fig. 1. The audio spectrogram transformer learning system uses various time-axis compression methods (frame shift modification, max/average pooling, and flexible patching) to adjust the resolution of audio spectrogram data at each training step, thereby enabling the transformer-based model to learn complex information progressively. In this process, the learned weights are passed to the next training step, allowing the transformer-based model to be fine-tuned to learn the details required at each step more accurately. This stepwise learning strategy prevents overfitting that can occur when the transformer-based model processes complex data, and enables the maximization of training performance while reducing resource consumption. FIG. 2 is a diagram illustrating the audio spectrogram transformer learning operation in one embodiment. The audio spectrogram transformer learning system can perform audio spectrogram input (210), time compression (220), multi-stage learning (230), and weight adjustment (240) operations. The operation of the audio spectrogram input (210) will be described. The audio spectrogram transformer learning system can convert an audio signal into a Mel-Spectrogram. The audio spectrogram transformer learning system [describes] the audio signal If this is given, It can process mel-spectrograms represented as . Here, mel(·) represents the spectrogram generation module. The mel-spectrogram is first divided into patches, and then token sequences are generated using the tokenization layer Token(·). It can be tokenized as follows, where f×t represents the number of tokens. The token sequence can be used as input to the encoder layer of the Transformer. When using square patches, it is important to note that the length of the time axis significantly affects the number of tokens. This also affects complexity, causing it to increase proportionally to the square. Mel-spectrogram assumes that Transformer-based models may not initially require such detailed repres