KR-20260064399-A - SPEAKER IDENTIFICATION SYSTEM AND METHOD FOR SEPARATING SPEAKER INFORMATION EXPRESSION AND ENVIRONMENT INFORMATION EXPRESSION FROM VOICE DATA

KR20260064399AKR 20260064399 AKR20260064399 AKR 20260064399AKR-20260064399-A

Abstract

A speaker identification system and method for separating speaker information representations and environment information representations from voice data are disclosed. A speaker identification system according to one embodiment may include: a separation representation learning unit that separates speaker information and environment information from voice data through an autoencoder; and an environment information removal unit that removes residual environment information from the separated speaker information to recognize a speaker from the separated speaker information through a speaker identification network.

Inventors

정준선
남기현

Assignees

한국과학기술원

Dates

Publication Date: 20260507
Application Date: 20241112
Priority Date: 20241031

Claims (15)

In a speaker identification system, A separation representation learning unit that separates speaker information and environment information from speech data through an autoencoder; and An environment information removal unit that removes residual environment information from the separated speaker information to recognize a speaker from the separated speaker information through a speaker identification network. A speaker identification system including
In paragraph 1, The above separation representation learning unit is, Compressing the speaker embeddings extracted from the above voice data into latent representations through the encoder of the above autoencoder A speaker identification system characterized by the following.
In paragraph 2, The above separation representation learning unit is, Separating the above compressed latent expression into speaker information and environment information A speaker identification system characterized by the following.
In paragraph 2, The above separation representation learning unit is, Calculating the distance between the decoder output and the speaker embedding through a reconstruction loss function to retain only important speaker information during the process of restoring the compressed latent representation through the decoder of the above autoencoder A speaker identification system characterized by the following.
In paragraph 1, The above speaker identification system is, Speaker embeddings are extracted from the voice data through a speaker embedding extractor, and the extracted speaker embeddings are input into the autoencoder. A speaker identification system characterized by the following.
In paragraph 1, The above separation representation learning unit is, Learning voice data from the same speaker using voice data collected from different environments by applying code swapping technology A speaker identification system characterized by the following.
In paragraph 1, The above separation representation learning unit is, Learning at least one speaker discriminator and multiple environment discriminators to learn detailed speaker data and detailed environment data within the speaker information and environment information output through the encoder of the above-mentioned autoencoder. A speaker identification system characterized by the following.
In Paragraph 7, The above-mentioned at least one speaker identifier is, A speaker identification system characterized by calculating speaker classification loss using speaker information for speaker recognition learning.
In Paragraph 7, The above plurality of environment discriminators includes a first environment discriminator and a second environment discriminator, and A speaker identification system characterized by the above-mentioned first environment discriminator transmitting environment information to an environment classifier to calculate an environment loss function for environment recognition learning.
In Paragraph 9, The above-mentioned second environment discriminator is a speaker identification system characterized by capturing residual environment information from speaker information.
In paragraph 1, The above environmental information removal unit is, Using adversarial learning techniques to remove residual environmental information that may remain within speaker information A speaker identification system characterized by the following.
In Paragraph 11, The above environmental information removal unit is, Learning speaker information to remove environment information that interferes with the second environment discriminator using a Gradient Reversal Layer (GRL). A speaker identification system characterized by the following.
In paragraph 1, The above environmental information removal unit is, Minimizing the correlation between speaker information and environment information using mean absolute pearson correlation (MAPC) loss A speaker identification system characterized by the following.
In a speaker identification method performed by a speaker identification system, A step of separating speaker information and environment information from speech data through an autoencoder; and A step of removing residual environment information from the separated speaker information to recognize a speaker from the separated speaker information through a speaker identification network. A speaker identification method including
In a computer program stored on a computer-readable storage medium for executing a speaker identification method performed by a speaker identification system, The above speaker identification method is, A step of separating speaker information and environment information from speech data through an autoencoder; and A step of removing residual environment information from the separated speaker information to recognize a speaker from the separated speaker information through a speaker identification network. A computer program stored on a computer-readable storage medium that executes.

Description

Speaker identification system and method for separating speaker information expression and environment information expression from voice data The following description concerns a technology for separating speaker information and environment information from voice data. With the rapid increase in voice-based services in modern society, speaker recognition technology is establishing itself as an essential element in various application fields. Speaker recognition technologies used in smart home devices, voice assistants, and security systems are required to deliver accurate performance in diverse environments. However, since not all users provide voice data in identical environments, existing speaker recognition systems struggle to maintain accurate identification capabilities in various settings, such as noisy streets, heavily reverberating indoor spaces, and vehicle interiors. As such, existing speech recognition and speaker identification systems suffer from the problem of degraded speaker identification performance in noisy environments. Since speech data contains not only information unique to the speaker but also information such as noise and reverberation from the recording environment, the accuracy of speaker recognition can vary significantly depending on the conditions. To solve this problem, technology is required to separate environmental information and speaker information and process each piece of information independently. FIG. 1 is a diagram illustrating the operation of separating voice data into speaker information and environment information in one embodiment. FIG. 2 is a diagram illustrating the operation of performing speaker recognition in one embodiment. FIG. 3 is a block diagram illustrating a speaker identification system in one embodiment. FIG. 4 is a flowchart illustrating a speaker identification method in one embodiment. Hereinafter, embodiments will be described in detail with reference to the attached drawings. In the embodiments, a speaker identification operation based on separation representation learning, which separates voice data into speaker information and environmental information, will be described. Through this, only the unique voice features of a speaker are separated and extracted regardless of the environmental characteristics in which the speaker's utterance occurred, thereby enabling the operation of a speaker recognition model that is robust to environmental changes. For example, the speaker recognition result can remain consistent whether the same speaker speaks in a quiet office or on a noisy street, which can significantly improve the stability of the service. Consequently, the present invention aims to significantly improve the reliability and accuracy of voice-based systems in various environments through a robust speaker identification system capable of responding to environmental variations in voice data. Through this, the invention seeks to improve the user experience, expand the scope of application of speaker identification technology, and contribute to the construction of safe and efficient voice-based systems. FIG. 1 is a diagram illustrating the operation of separating voice data into speaker information and environment information in one embodiment. Figure 1 describes a separation representation learning framework. A speaker identification system can separate speech data of the same speaker into speaker information and environment information in the same environment and/or different environments. Using speech data acquired through a special batch sampling method, the speaker identification system enables an autoencoder to compress and reconstruct input embeddings, and simultaneously allows for the effective separation of environment noise during the learning process while maintaining important speaker characteristics through four additional objective functions (L env_env , L env_spk , L corr , and L spk ). The speaker identification system can perform batch construction with data augmentation. In the embodiments, to aid understanding of the explanation, three voice data sets will be used as an example. Each mini-batch consists of three voice data sets, namely, the first voice data , second voice data and third voice data It is composed of, where i represents the mini-batch index. Although all three voice data are extracted from the same speaker, the first voice data and the second voice data may be extracted from the same video, while the third voice data may be extracted from a different video. This setup is intended to ensure that the first voice data and the second voice data reflect the same environmental conditions, and that the third voice data introduces a separate environment. The novelty of the batch configuration stems from data augmentation. The same augmentation technique is applied to the first voice data and the second voice data, while a different augmentation technique is applied to the third voice data, so that the first voice data