Search

KR-20260067828-A - ELECTRONIC DEVICE FOR COMPRESSING AND RESTORING AUDIO SIGNALS AND OPERATING METHOD THEREOF, AND A RECORDING MEDIUM

KR20260067828AKR 20260067828 AKR20260067828 AKR 20260067828AKR-20260067828-A

Abstract

An electronic device for compressing and restoring an audio signal and a method of operating the same are disclosed. According to one embodiment, the electronic device minimizes signal distortion occurring during the compression process of the audio signal and improves the quality of the restored audio signal, thereby not only stably providing a high-quality audio signal in various bit rate environments, but is also designed to operate at various bit rates during the compression and restoration process of the audio signal, so as to achieve any target bit rate according to the user's situation.

Inventors

  • 장인선
  • 김병현
  • 이지현
  • 임형섭
  • 임우택
  • 박수영
  • 백승권
  • 성종모
  • 조병호
  • 강정원
  • 이태진
  • 강홍구

Assignees

  • 한국전자통신연구원
  • 연세대학교 산학협력단

Dates

Publication Date
20260513
Application Date
20241106

Claims (20)

  1. In the method of operating an electronic device, A step of converting the original audio signal in the time domain into a frequency domain signal to obtain frequency domain coefficients; A step of extracting a bit rate-independent embedding vector and a bit rate-dependent embedding vector based on the above frequency domain coefficients and bit rate indicators; A step of normalizing the frequency domain coefficients by frequency band based on a normalization scale predicted using the bit rate-independent embedding vector; A step of quantizing frequency domain coefficients normalized by frequency band based on the predicted quantization unit using the bit rate-independent embedding vector and the bit rate-dependent embedding vector; A step of compressing the quantized frequency domain coefficients by converting them into a bit sequence based on the coefficient distribution predicted using the bit rate-independent embedding vector and the bit rate-dependent embedding vector; and A step of compressing by converting the result of quantizing the bit rate-independent embedding vector and the result of quantizing the bit rate-dependent embedding vector into a bit sequence. A method of operation including
  2. In paragraph 1, The above extraction step is, A step of generating a band embedding sequence based on vectors corresponding to specific frequency bands obtained in response to inputting the above frequency domain coefficients into a predetermined convolutional neural network; A step of merging the non-linear projection results for each band embedding constituting the band embedding sequence, obtained in response to inputting the band embedding sequence into a first deep neural network independently assigned to each frequency band, into a single vector; A step of obtaining the bit rate-independent embedding vector in response to inputting the merged vector into the second deep neural network. A method of operation including
  3. In paragraph 2, The above extraction step is, A step of merging the bit rate embedding vector extracted based on the linear projection result of the band embedding column and the bit rate index into a single vector column; A step of merging the frequency band-specific latent representations obtained in response to inputting the above vector sequence into a transformer into a single vector; and A step of obtaining the bit rate-dependent embedding vector in response to inputting the merged vector into the third deep neural network. A method of operation including
  4. In paragraph 1, The above normalization step is, Step of performing scaling for the frequency domain coefficients by selectively using a global normalization scale A method of operation that further includes
  5. In paragraph 1, The above quantization step is, A method of operation for converting the above normalized frequency domain coefficients into integer symbols by dividing them by the above predicted quantization unit and applying a rounding function.
  6. In paragraph 5, The step of compressing by converting the above-mentioned quantized frequency domain coefficients into a bit sequence is, A step of predicting the probability distribution of occurrence of the converted integer symbol for each of the above frequency bands; and A step of compressing and converting the quantized frequency domain coefficients into a bit sequence by assigning variable-length codewords to the converted integer symbols according to the predicted occurrence probability distribution. A method of operation including
  7. In paragraph 1, The step of compressing by converting the result of quantizing the bit rate-independent embedding vector and the result of quantizing the bit rate-dependent embedding vector into a bit sequence is: A step of generating a codebook based on a probability mass function predicted from the distribution of the pre-learned bit rate-independent embedding vector and the bit rate-dependent embedding vector; A step of generating a bit sequence for the result of quantizing the bit rate-independent embedding vector and the result of quantizing the bit rate-dependent embedding vector by assigning a codeword to the result of quantizing the bit rate-independent embedding vector and the result of quantizing the bit rate-dependent embedding vector using the codebook; and Step of compressing the generated bit sequence based on the predicted coefficient distribution A method of operation including
  8. In paragraph 1, The bit rate-independent embedding vector, the bit rate-dependent embedding vector, the normalization scale, the quantization unit, and the coefficient distribution are, A method of operation extracted and predicted through a deep neural network or a differentiable function.
  9. In paragraph 1, A step of updating parameters of a deep neural network or a differentiable function that extracts and predicts the bit rate-independent embedding vector, the bit rate-dependent embedding vector, the normalization scale, the quantization unit, and the coefficient distribution based on the distortion between the original audio signal and the restored audio signal and a loss function generated using the bit rate indicator. A method of operation that further includes
  10. In the method of operating an electronic device, A step of recovering bit-rate independent embedding vectors and bit-rate dependent embedding vectors from a bit sequence in which the original audio signal is compressed and transmitted; A step of restoring frequency domain coefficients normalized by frequency band from the bit sequence by using the predicted quantization unit and coefficient distribution based on the bit rate-independent embedding vector and the bit rate-dependent embedding vector; A step of restoring frequency domain coefficients from frequency domain coefficients normalized by frequency band by using a normalization scale predicted based on the bit rate-independent embedding vector; and A step of obtaining a restored audio signal by converting the above frequency domain coefficients into a time domain signal. A method of operation including
  11. In Paragraph 10, A step of updating parameters of a deep neural network or a differentiable function that extracts and predicts the bit rate-independent embedding vector, the bit rate-dependent embedding vector, the normalization scale, the quantization unit, and the coefficient distribution based on the distortion between the original audio signal and the restored audio signal and a loss function generated using the bit rate indicator. A method of operation that further includes
  12. In electronic devices, processor; and Memory that stores instructions Includes, When the above instructions are executed by the processor, the electronic device, Convert the original audio signal in the time domain into a frequency domain signal to obtain frequency domain coefficients, and Based on the above frequency domain coefficients and bit rate indicators, bit rate-independent embedding vectors and bit rate-dependent embedding vectors are extracted, and Based on the normalization scale predicted using the bit rate-independent embedding vector, the frequency domain coefficients are normalized by frequency band, and Quantize the frequency domain coefficients normalized for each frequency band based on the quantization unit predicted using the bit rate-independent embedding vector and the bit rate-dependent embedding vector, and Compressed by converting the quantized frequency domain coefficients into a bit sequence based on the coefficient distribution predicted using the bit rate-independent embedding vector and the bit rate-dependent embedding vector, and Compressing by converting the result of quantizing the bit rate-independent embedding vector and the result of quantizing the bit rate-dependent embedding vector into a bit sequence, Electronic device.
  13. In Paragraph 12, The above processor is, A band embedding sequence is generated based on vectors corresponding to specific frequency bands obtained in response to inputting the above frequency domain coefficients into a predetermined convolutional neural network, and The non-linear projection results for each band embedding constituting the band embedding sequence, obtained in response to inputting the band embedding sequence into a first deep neural network independently assigned to each frequency band, are merged into a single vector, and Acquiring the bit rate-independent embedding vector in response to inputting the above merged vector into the second deep neural network, Electronic device.
  14. In Paragraph 13, The above processor is, The bit rate embedding vector extracted based on the linear projection result of the above band embedding column and the above bit rate index is merged into a single vector column, and In response to inputting the above vector sequence into the transformer, the above frequency band-specific latent representations obtained are merged into a single vector, and Acquiring the bit rate-dependent embedding vector in response to inputting the above merged vector into a third deep neural network, Electronic device.
  15. In Paragraph 12, The above processor is, Performing scaling for the frequency domain coefficients using a global normalization scale selectively, Electronic device.
  16. In Paragraph 12, The above processor is, Dividing the above normalized frequency domain coefficients by the above predicted quantization unit, and then applying a rounding function to convert them into integer symbols, Electronic device.
  17. In Paragraph 16, The above processor is, Predict the probability distribution of the appearance of the converted integer symbol for each of the above frequency bands, and By assigning variable-length codewords to the transformed integer symbols according to the predicted occurrence probability distribution, the quantized frequency domain coefficients are compressed and transformed into a bit sequence. Electronic device.
  18. In Paragraph 12, The above processor is, Generate a codebook based on a probability mass function predicted from the distribution of the pre-trained bit rate-independent embedding vector and the bit rate-dependent embedding vector, and By assigning codewords to the result of quantizing the bit rate-independent embedding vector and the result of quantizing the bit rate-dependent embedding vector using the above codebook, a bit sequence for the result of quantizing the bit rate-independent embedding vector and the result of quantizing the bit rate-dependent embedding vector is generated, and Compressing the above-generated bit sequence based on the above-predicted coefficient distribution, Electronic device.
  19. In Paragraph 12, The bit rate-independent embedding vector, the bit rate-dependent embedding vector, the normalization scale, the quantization unit, and the coefficient distribution are, Extracted and predicted through deep neural networks or differentiable functions, Electronic device.
  20. In Paragraph 12, The above processor is, Updating parameters of a deep neural network or a differentiable function that extracts and predicts the bit rate-independent embedding vector, the bit rate-dependent embedding vector, the normalization scale, the quantization unit, and the coefficient distribution based on the distortion between the original audio signal and the restored audio signal and a loss function generated using the bit rate indicator, Electronic device.

Description

Electronic device for compressing and restoring audio signals, method of operation thereof, and recording medium Embodiments of the present disclosure relate to an electronic device for compressing and restoring an audio signal, a method of operating the same, and a recording medium. Audio signal compression and decompression technologies are widely used to efficiently store audio signals on devices or to rapidly transmit and receive them over communication networks. In particular, as the volume of media signals transmitted over the Internet gradually increases, there is growing interest in audio encoding and decoding technologies that can minimize distortion during restoration using only a smaller amount of compressed data. The information described above may be provided as related art for the purpose of aiding understanding of the present disclosure. No claim or determination is made as to whether any of the foregoing may be applied as prior art related to the present disclosure. In relation to the description of the drawings, the same or similar reference numerals may be used for identical or similar components. FIG. 1 is a diagram illustrating the configuration of an electronic device according to one embodiment. FIG. 2 is a diagram illustrating the process of encoding an original audio signal of an electronic device operating as an encoder according to one embodiment. FIG. 3 is a diagram illustrating a method for extracting a bit rate-independent embedding vector and a bit rate-dependent embedding vector according to one embodiment. FIG. 4 is a diagram illustrating the process of decoding a restored audio signal of an electronic device operating as a decoder according to one embodiment. FIG. 5 is a diagram illustrating a method for calculating a loss function using a learned acoustic perception quality predictor according to one embodiment. Specific structural or functional descriptions of the embodiments are disclosed merely for illustrative purposes and may be modified and implemented in various forms. Accordingly, actual implementations are not limited to the specific embodiments disclosed, and the scope of this specification includes modifications, equivalents, or substitutions included in the technical concept described by the embodiments. Terms such as "first" or "second" may be used to describe various components, but these terms should be interpreted solely for the purpose of distinguishing one component from another. For example, the first component may be named the second component, and similarly, the second component may be named the first component. When it is stated that a component is "connected" to another component, it should be understood that it may be directly connected to or joined to that other component, or that there may be other components in between. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this document, phrases such as “A or B,” “at least one of A and B,” “at least one of A or B,” “A, B or C,” “at least one of A, B and C,” and “at least one of A, B, or C” may each include any one of the items listed together with the corresponding phrase, or all possible combinations thereof. In this specification, terms such as “comprising” or “having” are intended to designate the existence of the described features, numbers, steps, actions, components, parts, or combinations thereof, and should be understood as not precluding the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof. Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by those skilled in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant technology, and should not be interpreted in an ideal or overly formal sense unless explicitly defined in this specification. Hereinafter, embodiments will be described in detail with reference to the attached drawings. In the description with reference to the attached drawings, identical components are given the same reference numeral regardless of the drawing number, and redundant descriptions thereof will be omitted. FIG. 1 is a diagram illustrating the configuration of an electronic device according to one embodiment. As illustrated in FIG. 1, the electronic device (100) may include one or more processors (110) and a memory (120) that loads or stores a program (130) executed by the processor (110). The components included in the electronic device (100) of FIG. 1 are merely examples, and a person skilled in the art to which the present invention pertains will know that other general-purpose components may be included in addition to the components shown in FIG. 1. The processor (110) controls the overall operation of each component of the electronic device (100). Th