KR-20260064123-A - Method And Apparatus for Speech Tokenization

KR20260064123AKR 20260064123 AKR20260064123 AKR 20260064123AKR-20260064123-A

Abstract

A method and apparatus for speech tokenization are disclosed. According to one aspect of the present disclosure, a computer-implemented method performed by one or more computers comprises: receiving an input audio waveform including a speech signal; generating one or more feature vectors representing the input audio waveform using an encoder neural network; and generating a plurality of quantized vectors for the feature vectors using a plurality of vector quantizers, wherein one of the plurality of quantized vectors includes semantic information of the speech signal and another of the plurality of quantized vectors includes acoustic information of the speech signal.

Inventors

정원진
강성일
조동연

Assignees

에스케이텔레콤 주식회사

Dates

Publication Date: 20260507
Application Date: 20241031

Claims (16)

As a computer implementation method performed by one or more computers, A process of receiving an input audio waveform containing a voice signal; A process of generating one or more feature vectors representing the input audio waveform using an encoder neural network; and A process of generating multiple quantized vectors for the feature vector using multiple vector quantizers—wherein one of the multiple quantized vectors includes semantic information of the speech signal, and another of the multiple quantized vectors includes acoustic information of the speech signal. A computer implementation method including
In paragraph 1, The above plurality of vector quantizers are arranged sequentially, and The process of generating the above plurality of quantized vectors is, A process of generating a first quantized vector containing semantic information of the voice signal using a first vector quantizer of the sequence of the plurality of vector quantizers; and A process of generating a second quantized vector containing acoustic information of the voice signal using a second vector quantizer of the sequence of the plurality of vector quantizers. A computer implementation method including
In paragraph 2, The process of generating the first quantized vector using the first vector quantizer is, The process of receiving the above feature vector; and The process of selecting the code vector with the shortest distance from the feature vector among the code vectors in the first codebook of the first vector quantizer. Includes, The process of generating the second quantized vector using the second vector quantizer is, A process of receiving a residual vector calculated based on the difference between the above feature vector and the above selected code vector; and The process of selecting the code vector with the shortest distance from the residual vector among the code vectors in the second codebook of the second vector quantizer. A computer implementation method including
A learning method performed by one or more computers for jointly training an encoder neural network and a plurality of vector quantizers, wherein A process of receiving training data, which is an input audio waveform containing a voice signal; A process of processing the input audio waveform using the encoder neural network to generate a feature vector representing the input audio waveform; A process of processing the feature vector using the plurality of vector quantizers to generate a plurality of quantized vectors—wherein one of the plurality of quantized vectors is a first quantized vector for representing semantic information of the voice signal, and another of the plurality of quantized vectors is a second quantized vector for representing acoustic information of the voice signal—; A process of processing the input audio waveform using a first teacher model to generate a first latent vector representing the semantic information of the voice signal; A process of processing the input audio waveform using a second teacher model to generate a second latent vector representing acoustic information of the above voice signal; and A process of aligning the first quantized vector to the first latent vector and aligning the second quantized vector to the second latent vector A learning method including
In paragraph 4, The above plurality of vector quantizers are arranged sequentially, and The process of processing the feature vector using the aforementioned plurality of vector quantizers is, A process of quantizing the feature vector using the first vector quantizer of the sequence of vector quantizers to generate the first quantized vector; and A process of quantizing the difference between the feature vector and the first quantized vector using a second vector quantizer of the sequence of vector quantizers to generate the second quantized vector. A learning method including
In paragraph 4, A learning method in which the first teacher model includes a HuBERT model, and the first latent vector includes semantic information generated by the HuBERT model.
In paragraph 4, A learning method in which the second teacher model includes an ECAPA-TDNN model, and the second latent vector includes acoustic information generated by the ECAPA-TDNN model.
In paragraph 4, The process of aligning the above quantized vectors to each of the above potential vectors is, A process for determining a first distillation loss based on the difference between the first quantized vector and the first latent vector; A process for determining a second distillation loss based on the difference between the second quantized vector and the second potential vector; A process for determining the gradient of a loss function including the first distillation loss and the second distillation loss; and A process of updating the parameter set of the encoder neural network, the codebook of the first vector quantizer, and the codebook of the second vector quantizer based on the gradient of the loss function. A learning method including
In paragraph 8, A learning method in which the second distillation loss is the Kullback-Leibler divergence loss.
In paragraph 4, The above encoder neural network and the plurality of vector quantizers are jointly trained with a multi-head attention neural network, The process of aligning the second quantized vector to the second potential vector is, A process of processing the second potential vector using the above-described multi-head attention neural network; and A process of aligning the second latent vector processed by the multi-head attention neural network to the second quantized vector A learning method including
In paragraph 8, The above encoder neural network and the plurality of vector quantizers are jointly trained together with the decoder neural network, The above learning method further includes a process of processing the plurality of quantized vectors using the decoder neural network to generate an output audio waveform, and The above loss function includes a reconstruction loss determined based on the difference between the input audio waveform and the output audio waveform, wherein A learning method in which the above update process further includes a process of updating a parameter set of the decoder neural network based on the gradient of the above loss function.
In Paragraph 11, The encoder neural network and the plurality of vector quantizers are jointly trained with the discriminator neural network, The above learning method is, The process of the above discriminator neural network receiving an audio waveform; and The method further includes a process of processing the audio waveform using the discriminator neural network to generate a discriminator score indicating the possibility that the audio waveform is an audio waveform input to the encoder neural network or an audio waveform output by the decoder neural network. The above loss function includes an adversarial loss determined based on the above discriminator score, wherein A learning method in which the above update process further includes a process of updating the parameter set of the discriminator neural network based on the gradient of the above loss function.
In Paragraph 12, A learning method comprising a loss function including a feature matching loss determined based on the difference between one or more intermediate outputs generated by the discriminator neural network processing the input audio waveform and one or more intermediate outputs generated by the discriminator neural network processing the output audio waveform.
In paragraph 1, A learning method in which the above input audio waveform is a voice waveform containing noise.
At least one memory for storing instructions; and Includes at least one processor, The above at least one processor executes the above instructions, Receive an input audio waveform containing a voice signal, and Using an encoder neural network, one or more feature vectors representing the input audio waveform are generated, and Generate multiple quantized vectors for the feature vector using multiple vector quantizers, A device in which one of the plurality of quantized vectors includes semantic information of the voice signal, and another of the plurality of quantized vectors includes acoustic information of the voice signal.
A computer program stored on a computer-readable recording medium to execute each process included in the method according to any one of paragraphs 1 through 3 and paragraphs 4 through 14.

Description

Method and Apparatus for Speech Tokenization The present disclosure relates to a method and apparatus for speech tokenization. More specifically, it relates to a method and apparatus capable of tokenizing semantic and acoustic elements contained in a speech signal. The following description merely provides background information related to the present embodiment and does not constitute prior art. Speech tokenization is the process of converting a continuous speech signal into discrete units (e.g., tokens). By tokenizing speech signals, speech data can be efficiently represented and utilized in various tasks (e.g., speech recognition, speech synthesis, etc.). Recently, multimodal LLMs capable of processing various inputs such as text, speech, audio, images, and video have been gaining attention. Token-based multimodal LLMs are designed to process and integrate individually tokenized signals from diverse domains, such as speech and images, along with text tokens, within an integrated learning framework. Efficient speech tokenization is required for token-based multimodal LLMs to understand and process speech. Recently, a method for tokenizing speech signals by applying a speech codec has been proposed. This method extracts key features of the speech signal through an encoder and quantizes the extracted features by considering semantic features modeled after the human vocal tract. Although the speech understanding performance of multimodal LLMs has improved by inputting the tokens generated through quantization into the multimodal LLM, conventional methods do not consider information about the human vocal cords, which limits the multimodal LLM's ability to identify information such as the speaker's emotion, gender, and age. To further enhance the speech understanding performance of multimodal LLMs, a speech tokenization method is required that considers not only semantic features but also acoustic features modeled after vocal cord information. FIG. 1 is a block diagram schematically illustrating a voice tokenization system according to one embodiment of the present disclosure. Figure 2 is an example diagram illustrating an example of a learning system for jointly training an encoder neural network and a residual vector quantizer included in a speech tokenization system. FIG. 3 is a flowchart illustrating the process of tokenizing speech according to one embodiment of the present disclosure. FIG. 4 is a flowchart illustrating the process of jointly training an encoder neural network and a residual vector quantizer included in a speech tokenization system according to one embodiment of the present disclosure. FIG. 5 is an experimental result table comparing the speech signal reconstruction performance of a speech tokenization model, an EnCodec model, and a SpeechTokenizer model according to one embodiment of the present disclosure. Figure 6 is an example diagram showing spectrograms of the original speech and spectrograms of the speech reconstructed from the original speech to visually demonstrate the performance of the speech codec. FIG. 7 is a conceptual diagram illustrating a method for performing speech conversion using a speech tokenization model according to one embodiment of the present disclosure. FIG. 8 is an experimental result table comparing the speech conversion performance of a speech tokenization model, a SpeechTokenizer model, and a FreeVC model according to one embodiment of the present disclosure. FIG. 9 is an experimental result table comparing the speech emotion recognition performance of a speech tokenization model, an EnCodec model, and a SpeechTokenizer model according to one embodiment of the present disclosure. FIG. 10 is an experimental result table comparing the automatic speech recognition performance of a multimodal LLM using a speech tokenization model according to one embodiment of the present disclosure and a multimodal LLM using a SpeechTokenizer model. FIG. 11 is a schematic block diagram illustrating an exemplary computing device that can be used to implement a method or device according to the present disclosure. Some embodiments of the present disclosure are described in detail below with reference to exemplary drawings. It should be noted that in assigning reference numerals to the components of each drawing, the same components are given the same reference numeral whenever possible, even if they are shown in different drawings. Furthermore, in describing the present disclosure, if it is determined that a detailed description of related known components or functions could obscure the essence of the present disclosure, such detailed description is omitted. In describing the components of the embodiments according to the present disclosure, symbols such as first, second, i), ii), a), b), etc., may be used. These symbols are intended only to distinguish the components from other components, and the essence, order, or sequence of the components is not limited by the symbols. Whe