CN-116524907-B - End-to-end speaker recognition method based on multiscale SincNet and CGAN

CN116524907BCN 116524907 BCN116524907 BCN 116524907BCN-116524907-B

Abstract

The invention belongs to the technical field of speaker recognition, and particularly discloses an end-to-end speaker recognition method based on multiscale SincNet and CGAN. The invention introduces a multiscale SincNet, avoids losing important information during manual feature conversion, and the multiscale SincNet captures low-level voice representations of three channels in waveforms according to three custom filter banks, so that a SincGAN model better captures important narrowband speaker features. Experimental results show that the model of the invention shows better performance on TIMIT and LIBRISPEECH corpus, and the model of the invention shows stronger robustness than a baseline method when training data is lacked.

Inventors

WEI GUANGCUN
ZHANG YANNA
Guo Boyan
MIN HANG
XU YUNFEI

Assignees

山东科技大学

Dates

Publication Date: 20260508
Application Date: 20230428

Claims (6)

1. An end-to-end speaker recognition method based on multi-scale SincNet and CGAN, characterized in that, The method comprises the following steps: Step 1, performing voice framing pretreatment operation on an input original voice signal to obtain a voice frame, and taking the voice frame as a real voice sample; Step2, constructing a speaker recognition model SincGAN; The speaker recognition model SincGAN consists of a generator network and an identifier network; the generator network comprises a multi-scale SincNet layer, three convolution layers, two transposed convolution layers and an adaptive average pooling layer; defining a multi-scale SincNet layer in a generator network as a first multi-scale SincNet layer, defining three convolution layers in the generator network as a first convolution layer, a second convolution layer and a third convolution layer respectively, and respectively using two transposition convolution layers as the first transposition convolution layer and the second transposition convolution layer; the processing flow of the real voice sample in the generator network is as follows: The method comprises the steps that a real voice sample firstly carries out feature extraction through a first multi-scale SincNet layer to obtain two-dimensional features of a voice signal, and then the two-dimensional features of the voice signal sequentially pass through a first convolution layer, a second convolution layer, a first transposition convolution layer, a second transposition convolution layer, a third convolution layer and a self-adaptive average pooling layer to generate a false voice sample; The discriminator network comprises a multi-scale SincNet layer, five convolution layers, three bottleneck residual block stacking layers and four full connection layers; defining a multi-scale SincNet layer in the discriminator network as a second multi-scale SincNet layer; defining five convolution layers in the discriminator network as fourth, fifth, sixth, seventh and eighth convolution layers, respectively; defining three bottleneck residual block stacking layers in the discriminator network as a first bottleneck residual block stacking layer, a second bottleneck residual block stacking layer and a third bottleneck residual block stacking layer respectively, and four full connection layers as a first full connection layer, a second full connection layer, a third full connection layer and a fourth full connection layer respectively; the process flow of the real voice sample and the false voice sample in the discriminator network is as follows: The method comprises the steps that firstly, a real voice sample and a false voice sample are subjected to feature extraction through a second multi-scale SincNet layer to obtain two-dimensional features of a voice signal, and then the two-dimensional features of the voice signal sequentially pass through a fourth convolution layer, a first bottleneck residual block stacking layer, a fifth convolution layer, a second bottleneck residual block stacking layer, a sixth convolution layer, a second bottleneck residual block stacking layer, a seventh convolution layer, an eighth convolution layer, a first full connection layer and a second full connection layer; the output of the second full-connection layer is divided into two paths, one path outputs a true/false mark through the third full-connection layer, and the other path outputs an N-dimensional vector through the fourth full-connection layer, wherein the N-dimensional vector corresponds to the speaker type label of the real voice sample respectively; The N-dimensional vector output by the discriminator network is input into a Softmax function, and the speaker identification label of the highest probability prediction class is used as a prediction output by mapping the output vector onto probability distribution; Step 3, training the speaker recognition model SincGAN constructed in the step 2 by using the training sample in the step 1, optimizing parameters of the speaker recognition model by back propagation to minimize an objective function to obtain a trained speaker recognition model SincGAN, and testing the trained speaker recognition model SincGAN by using the test sample; And 4, predicting a given voice signal by using the trained SincGAN, and outputting a corresponding speaker tag.
2. The end-to-end SincGAN speaker identification method as defined in claim 1, The first multi-scale SincNet and the second multi-scale SincNet have the same structure; Each multi-scale SincNet consists of a layer normalization and three different parallel branches, wherein each parallel branch comprises a SincNet layer, a batch normalization layer and a one-dimensional self-adaptive average pooling layer; The filters in SincNet layers on each parallel branch have different kernel lengths; the feature extraction process of the speech frame on the multiscale SincNet is as follows: firstly, a voice frame is subjected to layer normalization, then enters SincNet filters of three parallel branches, respectively learns speaker characteristic graphs with different frequency resolutions, and then the speaker characteristic graphs output by SincNet filters on each parallel branch are respectively processed by a batch normalization layer and a one-dimensional self-adaptive average pooling layer on the branch; And finally, stacking the one-dimensional speaker characteristic graphs output by the three parallel branches into a two-dimensional speaker characteristic graph.
3. The end-to-end SincGAN speaker identification method as defined in claim 1, Defining N speakers, wherein each speaker has m sentences of utterances, and the m-th sentence set of the kth speaker is S k ＝{h 1 ,h 2 ,...,h m },k＝1,2,...,N,h 1 ,h 2 ,...,h m to respectively represent m different utterances; The speaker recognition model SincGAN takes the speech frame as input data, and divides the utterance x input into the speaker recognition model SincGAN into n speech frames { x 1 ,x 2 ,...,x n } of fixed length; each speech frame is fed sequentially into SincGAN and the last layer of the discriminator network uses Softmax layer output I.e., the probability that the ith speech frame is from the kth speaker S k of the N speakers; The estimated speaker identity of the whole utterance x corresponds to the speaker tag label with the highest probability, and the formula is as follows:
4. The end-to-end SincGAN speaker identification method as defined in claim 1, In the step 3, in the training process, wasserstein is used for measuring the distance between the real voice sample and the false voice sample characteristics, and the least square loss is used for further correcting the objective function of SincGAN; Objective function of generator G As shown in formula (1); wherein x represents a true sample, i.e., a true voice sample, G (x) represents a generated sample of the generator, i.e., a false voice sample; p r (x) represents the distribution of real samples, An expected value representing the probability that the discriminator judges the false speech sample as a true speech sample minus 1 square; E (·) is the calculation of the expected value; by adding an L1 norm in the generator G, the distance between the generated sample and the real sample can be minimized, the generated sample is more real, and the weight of the L1 norm is controlled by using the super parameter omega; objective function of discriminator D As shown in formula (2); D (x) represents the probability that x is a true sample; the expected value representing the square of the probability that the discriminator will determine a false speech sample as a true speech sample, An expected value representing the probability of the discriminator judging on the real speech sample minus 1 square; loss D represents the class-wise cross-entropy penalty for all speaker classes, as shown in equation (3): Wherein, the Representing expected values for x and y subject to a P r (x, y) distribution; x and y represent the real speech sample and the tag, respectively, y i represents the i-th speech frame of the real tag, y i ∈R N constitutes a vector for each speech frame i, k i represents the probability that the model prediction y belongs to the i-th class, and N represents the total number of speakers.
5. A computer device comprising a memory and one or more processors, the memory having executable code stored therein, wherein the processor, when executing the executable code, performs the steps of the multi-scale SincNet and CGAN based end-to-end speaker recognition method of any one of claims 1 to 4.
6. A computer readable storage medium having stored thereon a program, wherein the program when executed by a processor implements the steps of the multi-scale SincNet and CGAN based end-to-end speaker recognition method according to any one of claims 1 to 4.

Description

End-to-end speaker recognition method based on multiscale SincNet and CGAN Technical Field The invention belongs to the technical field of speaker recognition, and particularly relates to an end-to-end speaker recognition method based on multiscale SincNet and CGAN. Background Speaker recognition (Speaker Identification, SI) is a biometric technique that can distinguish from a number of known speakers, who is speaking, based on a piece of speech, corresponding to a one-to-many selection relationship. This task is very challenging for humans. Speaker recognition is widely used in many fields, and the main reasons are that voiceprints have the advantages of easy acquisition, non-contact, stable characteristics and the like. With the development of deep learning, the deep neural network is outstanding in feature extraction and model classification, and a new direction is indicated for the further development of speaker recognition technology. Recently, various recognition models have been proposed that achieve a fairly high degree of accuracy, but still present challenges in practical industrial applications. For example, problems such as short speech recognition, dialect recognition, disordered speech recognition, etc. are encountered in practical applications. Short speech results in poor robustness of SI systems due to the inability to extract sufficient discriminative information. In addition, under the constraint of limited training data, the overfitting phenomenon occurs due to the fact that more effective speaker characteristic parameters cannot be extracted. In particular, deep neural network-based data-driven modeling methods require massive amounts of training data. However, due to the limitation of the practical environment, more voice data of the user cannot be easily obtained, and sufficient information representing the characteristics of the speaker cannot be extracted. The traditional speaker recognition implementation process is complex and low in recognition rate, and the process comprises voice signal preprocessing, acoustic feature extraction, classification model construction and learning model evaluation. Generally, building and applying a speaker recognition system requires two phases, a training phase and a testing phase. However, whether training or testing, it is necessary to first pre-process the input raw signal and perform feature extraction. In terms of feature extraction, most attempts have been made to design based on hand-made features, such as Mel-frequency cepstral coefficients (MFCCs) and Mel-filter bank coefficients (FBank). Reynolds et al train a Gaussian mixture model-generic background model (Gaussian Mixture Model-Universal Background Model, GMM-UBM) using extracted acoustic features to alleviate the data sparseness problem. In addition, in order to solve the problem of reduced recognition performance caused by channel interference, campbell et al research in 2006 adds a Support Vector Machine (SVM) to the GMM-UBM to effectively improve the model recognition performance, and Kenny et al research in depth on joint factor analysis in the next year only extracts characteristics related to speakers to overcome the influence of channel variability. In 2010 Dehak et al propose to map speech onto a fixed, low-dimensional vector, i.e., to represent a given utterance with an I-vector. The method improves the robustness and generalization capability of the SI system. With the development of Deep Neural Networks (DNNs), researchers have begun to trend toward using DNNs-based methods instead of traditional methods. The DNNs-based data-driven modeling method relies on large-scale training data, and is easily limited by environment in reality, and a large amount of voice data of a user cannot be acquired. In addition, the manual features often used for deep learning may lose important information during the conversion process, resulting in reduced recognition performance. Disclosure of Invention The invention aims to provide an end-to-end speaker recognition method based on multiscale SincNet and CGAN, which is characterized in that input original waveforms are directly recognized by introducing multiscale SincNet, so that important information is prevented from being lost during manual feature conversion, and meanwhile, an end-to-end recognition is performed on an countermeasure network by utilizing condition generation, so that a few training sentences are used for recognizing speakers. In order to achieve the above purpose, the invention adopts the following technical scheme: An end-to-end speaker recognition method based on multi-scales SincNet and CGAN, comprising the steps of: step 1, performing voice framing pretreatment operation on an input original voice signal to obtain a voice frame, and taking the voice frame as a real voice sample; Step2, constructing a speaker recognition model SincGAN; The speaker recognition model SincGAN consists of a generator network and an identif