CN-121983066-A - Speaker identity authentication method based on dynamic feature confusion and decoupling attention

CN121983066ACN 121983066 ACN121983066 ACN 121983066ACN-121983066-A

Abstract

The invention discloses a speaker identity authentication method based on dynamic feature confusion and decoupling attention, which comprises the steps of firstly extracting features of an original voice signal to generate an original feature map F, generating a dynamic confusion mask M, fusing the dynamic confusion mask M with the F element by element to obtain a confusion feature map F ', inputting the confusion feature map F' or the original feature map F into a deep neural network encoder to extract a high-dimensional speaker feature vector V, inputting the feature vector of a voice to be verified and the feature vector of a registered voice into a decoupling attention matching module in a verification stage, calculating a decoupling similarity score, and judging whether the voice to be verified and the registered voice come from the same speaker or not according to whether the decoupling similarity score exceeds a preset threshold. According to the invention, the confusion effect is separated by the decoupling attention module in the authentication stage, so that accurate identity matching is realized, and the recognition performance and the anti-interference capability of the system under a complex environment are improved.

Inventors

WANG QINGYUN
Lv Shichun
LIANG RUIYU
XIE YUE
Wang Sunyi

Assignees

南京工程学院

Dates

Publication Date: 20260505
Application Date: 20260226

Claims (10)

1. A speaker identity authentication method based on dynamic feature confusion and decoupling attention is characterized by comprising the following steps: s1, extracting characteristics of an original voice signal to generate an original characteristic diagram F; S2, generating a dynamic confusion mask M with the same size as the original feature map F, and fusing the dynamic confusion mask M with the F element by element to obtain a confusion feature map F'; s3, inputting the confusion characteristic diagram F' or the original characteristic diagram F into a deep neural network encoder, and extracting a high-dimensional speaker characteristic vector V; S4, in the verification stage, the feature vector V verify of the voice to be verified and the feature vector V enroll of the registered voice are input into a decoupling attention matching module, wherein the module firstly calculates an attention weight matrix A which is used for focusing on the channel dimension which is most critical for distinguishing the speaker and simultaneously suppresses the random interference introduced by dynamic confusion in the training stage, and then calculates a decoupling similarity score S between V verify and V enroll according to the attention weight matrix A; And S5, judging whether the voice to be verified and the registered voice come from the same speaker according to whether the decoupling similarity score S exceeds a preset threshold.
2. The speaker identity authentication method based on dynamic feature confusion and decoupling attention according to claim 1, wherein step S2 comprises randomly selecting K rectangular areas on a time-frequency two-dimensional space of an acoustic feature map, assigning a random constant uniformly distributed in [ beta, 1] to mask values in each rectangular area, setting mask values outside the area to 1, wherein 0< beta <1, so as to simulate local randomness loss or disturbance of the acoustic feature on the time-frequency domain, generating a dynamic confusion mask M with the same size as the original feature map F, randomly distributing the mask M values between 0 and 1, and fusing the mask M with F element by element to obtain a confusion feature map F' =F (alpha M+ (1-alpha)), wherein alpha represents element-by-element multiplication, and alpha is a controllable confusion intensity coefficient.
3. The speaker identity authentication method based on dynamic feature confusion and distraction of claim 1, wherein the number K, size and location of rectangular regions are dynamically changed in each training batch, and the range of variation is adaptively adjusted according to the average energy distribution of the speech, so that the time-frequency region with high energy is selected for confusion with higher probability.
4. The speaker identity authentication method based on dynamic feature confusion and distraction of claim 1, wherein in step S3, the deep neural network encoder is a time delay neural network TDNN or a convolutional neural network CNN, and in the training phase, the dynamic confusion processing step is inserted before the first layer convolutional or fully-connected layer of the encoder.
5. The speaker identity authentication method based on dynamic feature confusion and distraction of claim 1, wherein step S4 specifically comprises the steps of: S4.1, setting a feature vector V verify ∈R d×1 to be verified, registering feature vectors V enroll ∈R d×1 and d as feature vector dimensions, and performing dimension splicing on the two feature vectors to obtain a fusion feature vector V cat =Concat(V verify, V enroll )∈R 2d×1 , inputting the fusion feature vector V cat into a lightweight full-connection network to generate a channel attention weight vector W matched with the number of feature channels; S4.2, inputting the weight vector W into a Sigmoid activation function for normalization processing, and obtaining a normalized attention weight W norm, normalization formula with the value range between [0,1] as follows: wherein, sigma is% ) For the Sigmoid activation function, e is a natural constant, W norm is consistent with the W dimension of the original weight vector, and the element value of the Sigmoid activation function represents the importance of the corresponding characteristic channel on speaker identity discrimination; S4.3, calculating the decoupled eigenvector: V verify ′=V verify ⊙W norm , V enroll ′=V enrol l⊙W norm ; Wherein, as follows, element-wise multiplication; S4.4, calculating a decoupling similarity score, wherein S=sine_similarity (V verify ′,V enroll '); The method comprises the steps of S is decoupling similarity score, cosine similarity is a core measurement index for measuring similarity of two same-dimensional vector directions, the core measurement index is used for calculating similarity of the feature vectors after decoupling, a result directly reflects identity matching degree of the voice to be verified and the registered voice, a value range is [ -1,1] the closer the value is to 1, the more consistent the two feature vector directions are represented, the higher the speaker identity matching degree is, the closer the value is to 0, and the lower the representative matching degree is.
6. The speaker identity authentication method based on dynamic feature confusion and decoupling attention according to claim 5, wherein in step S4.1, the lightweight fully-connected network is optimized by adopting a contrast learning loss during training, and the optimization objective is to maximize the decoupling similarity between feature vectors of different confusion samples of the same speaker and minimize the decoupling similarity between feature vectors of different speakers, and the contrast loss formula is as follows: ; The optimization goal is to maximize the decoupling similarity between the feature vectors of different confusion samples of the same speaker and minimize the decoupling similarity between the feature vectors of different speakers, wherein sim @ is ) In order to decouple the cosine similarity, The learning loss value is compared to be a non-negative number, the smaller the loss is, the stronger the feature discrimination of the model is, V p ′ is the anchor sample decoupling feature, V q ′ is the positive sample decoupling feature, V n ′ is the negative sample decoupling feature, N is a positive integer, the number of negative samples corresponding to the target samples in a single batch is, The temperature coefficient in the comparison study is shown to control the sharpness of the similarity distribution, which is an adjustable super parameter.
7. A speaker identity authentication system based on dynamic feature confusion and decoupling attention for implementing the method of claim 1, comprising: the voice acquisition module is used for acquiring voice signals of a user, the memory is used for storing voiceprint feature vectors of registered speakers and computer programs for realizing the method, the processor is used for executing the computer programs and carrying out processing and authentication decision on the acquired voices, and the output module is used for outputting authentication results; The system is integrated in any one of an intelligent door lock, a mobile payment device, a vehicle-mounted voice system or a conference system identity verification terminal.
8. A computer device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method of claim 1.
9. A computer readable storage medium having stored thereon a computer program/instruction which when executed by a processor performs the steps of the method of claim 1.
10. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of claim 1.

Description

Speaker identity authentication method based on dynamic feature confusion and decoupling attention Technical Field The invention belongs to the technical field of voice signal processing and biological feature recognition, and particularly relates to a speaker identity authentication method based on dynamic feature confusion and decoupling attention, which is suitable for scenes such as voiceprint recognition, identity verification, security access control and the like. Background Speaker recognition (Speaker Recognition) is a biological recognition technology for identity authentication by utilizing acoustic features in individual voices, and has the advantages of non-contact, easy acquisition, high user acceptance and the like. In recent years, speaker recognition systems (e.g., x-vector, ECAPA-TDNN, etc.) based on deep learning have made significant progress. However, existing systems still face the following challenges: environmental and channel interference, namely noise, reverberation, transmission channel change and other factors can seriously influence the stability of voiceprint characteristics; the speaker state changes, such as common cold, emotion, age increase and the like, can cause acoustic characteristic drift; spoofing attack, namely threatening the security of the system by attack means such as recording playback, voice synthesis and the like; the feature robustness is insufficient, and the model is easy to overfit to irrelevant features (such as background noise, equipment characteristics and the like) in training data in the training process, so that generalization capability is reduced. In order to improve the robustness of the model, the existing methods often adopt data enhancement (such as noise adding, speed changing and reverberation adding) or feature regularization technologies, but the methods can only simulate limited environmental changes, and the identity discrimination of the features is difficult to maintain while the robustness is enhanced. Therefore, there is a need for an identity authentication method that can enhance the robustness of the model to noise and interference and accurately restore the essential characteristics of the speaker in the authentication phase. Disclosure of Invention The invention aims to provide a speaker identity authentication method and terminal based on dynamic feature confusion and decoupling attention, which are used for forcing a model to learn more robust speaker essential features by introducing a controllable dynamic feature confusion mechanism in a training stage, and realizing accurate identity matching by separating the confusion effect through a decoupling attention module in an authentication stage, so that the recognition performance and the anti-interference capability of a system in a complex environment are improved. The invention discloses a speaker identity authentication method based on dynamic feature confusion and decoupling attention, which comprises the following steps: s1, extracting characteristics of an original voice signal to generate an original characteristic diagram F; S2, generating a dynamic confusion mask M with the same size as the original feature map F, and fusing the dynamic confusion mask M with the F element by element to obtain a confusion feature map F'; S3, inputting the confusion characteristic diagram F 'or the original characteristic diagram F into a deep neural network encoder to extract the high-dimensional speaker characteristic vector V, wherein the complete logic of S3 is that the training stage only can input the confusion characteristic diagram F' (the dynamic confusion processing of the step S2 is necessary, which is the core step of the training stage and aims at enabling a model to learn robust characteristics), and the authentication stage only can input the original characteristic diagram F (the original characteristic is directly used for encoding without the step S2 and the additional confusion is avoided to influence authentication accuracy). S4, in the verification stage, the feature vector V verify of the voice to be verified and the feature vector V enroll of the registered voice are input into a decoupling attention matching module, wherein the module firstly calculates an attention weight matrix A which is used for focusing on the channel dimension which is most critical for distinguishing the speaker and simultaneously suppresses the random interference introduced by dynamic confusion in the training stage, and then calculates a decoupling similarity score S between V verify and V enroll according to the attention weight matrix A; And S5, judging whether the voice to be verified and the registered voice come from the same speaker according to whether the decoupling similarity score S exceeds a preset threshold. Further, in step S2, K rectangular areas are selected randomly on a time-frequency two-dimensional space of the acoustic feature map, a random constant uniformly distrib