CN-120412591-B - Voiceprint recognition method based on double-feature branch structure

CN120412591BCN 120412591 BCN120412591 BCN 120412591BCN-120412591-B

Abstract

The invention relates to the technical field of deep learning voiceprint recognition, in particular to a voiceprint recognition method based on a double-feature branch structure, which comprises the steps of respectively extracting Mel cepstrum features and wavelet transformation features from an original voice signal and forming two branches; the method comprises the steps of respectively inputting the characteristics into a self-attention network and a convolution TDNN network for multi-scale characteristic modeling, fusing two paths of outputs, further calculating multi-level discrimination loss for fused voiceprint characterization to enhance speaker discrimination under noisy or mismatched environments, decoding or up-sampling the fused output and taking the fused output as the input of next-stage processing, and finally generating the voiceprint characteristics with multi-resolution and more robustness through a cascade codec structure. The method aims to overcome the defect of single-path feature extraction in a complex environment, and the capturing and identifying capacity of multi-resolution voice features is remarkably improved by combining the advantages of self-attention and convolution TDNN.

Inventors

DING JIANRUI
WANG XIN
DING ZHUO

Assignees

哈尔滨工业大学（威海）
南京龙垣信息科技有限公司

Dates

Publication Date: 20260512
Application Date: 20250403

Claims (9)

1. A voiceprint recognition method based on a dual-feature branch structure is characterized by comprising the following steps: Preprocessing an original voice signal, and respectively extracting Mel cepstrum features and wavelet transformation features to form two feature branches; inputting the Mel cepstrum features into a self-attention module for sequence modeling of the multi-scale time-frequency features; inputting wavelet transformation characteristics into a convolution TDNN network, and performing time delay convolution processing on the wavelet coefficients after segmentation or framing; feature fusion is carried out on the result of sequence modeling of the Mel cepstrum features by the self-attention module and the result of time delay convolution processing of the wavelet transformation features by the convolution TDNN network, so that fused voiceprint feature representation is obtained; Inputting the fused voiceprint feature representation into an SE module and an average statistics pooling layer and connecting the SE module with a linear layer, and finally carrying out speaker classification or similarity measurement by adopting AAM-Softmax loss; The method also comprises the steps of extracting and downsampling the multi-scale features, filling an input layer of each stage of the encoder through boundary reflection, and activating the features by utilizing a sliding window convolution with the step length of 1, wherein an output layer is combined with a component separation convolution to realize incremental increase of feature channels.
2. The voiceprint recognition method based on the dual feature branch structure of claim 1, In the step of preprocessing an original voice signal and respectively extracting Mel cepstrum features and wavelet transformation features to form two feature branches, the extraction of the Mel cepstrum features satisfies the following formula: ; ; Wherein, the As a fourier transform of the signal, The mask for mel-frequency cepstrum can be calculated from the following formula: ; Wherein, the For the number of frequency domain samples, The number of Mel filters.
3. The voiceprint recognition method based on the dual feature branch structure of claim 2, In the step of preprocessing an original voice signal and respectively extracting Mel cepstrum features and wavelet transformation features to form two feature branches, the extraction of the wavelet transformation features satisfies the following formula: ; Wherein, the In order to input a signal to the device, As a function of the mother wavelet, Is the scale factor of the scale factor, Is a translation factor.
4. The voiceprint recognition method based on the dual feature branch structure of claim 3, The specific contents of the step of inputting Mel cepstrum features into a self-attention module and performing sequence modeling of multi-scale time-frequency features include: Linear mapping to obtain The Mel cepstrum feature matrix Multiplying respectively by a learnable parameter matrix And : ; Will be And (3) with Dot product is performed and divided by Normalizing, obtaining attention weight matrix by softmax, and finally combining with Multiplying: ; For all attention heads Is spliced and multiplied by a leachable output map : 。
5. The voiceprint recognition method based on the dual feature branch structure of claim 4, In the step of 'inputting wavelet transformation characteristics into a convolution TDNN network and carrying out time delay convolution processing on the wavelet coefficients after segmentation or framing', the convolution TDNN network carries out time delay convolution processing on the wavelet transformation characteristics after segmentation or framing, and captures local characteristics through a layer normalization and convolution space gating unit and then projects the local characteristics back to the original dimension.
6. The voiceprint recognition method based on the dual feature branch structure of claim 5, In the step of 'feature fusion of the two branch outputs', a covariance transformation or channel normalization mechanism is introduced to align feature distribution, and the feature distribution is restored into a fused feature map through linear mapping, which is expressed as: 。
7. The voiceprint recognition method based on the dual feature branch structure of claim 6, In the step of inputting the fused voiceprint feature representation into an SE module and an average statistical pooling layer and connecting the SE module with a linear layer, the SE module carries out channel weighting on the fused feature map, the average statistical pooling layer aggregates global time sequence features and is connected with the linear layer to output voiceprint embedding, and the operation process of the average statistical pooling layer is expressed as follows: computing channel dependent soft attention mechanisms, self-attention weights Representing the importance of each frame of a given channel, the output being activated by each frame The calculation results are that: ; ; weighted average for a given speech Each channel component of (a) The method comprises the following steps: ; weighted average for a given speech Each channel component of (a) The method comprises the following steps: 。
8. the voiceprint recognition method based on the dual feature branch structure of claim 7, In step "finally employing AAM-Softmax penalty for speaker classification or similarity measure", The AAM-Softmax loss function can be calculated by the following formula: ; wherein for all categories , For the correct category in-situ cosine Adding an additional angle to , 。
9. The voiceprint recognition method based on the dual feature branch structure of claim 8, The voiceprint recognition method based on the dual-feature branch structure further comprises a voiceprint database updating step, wherein the voiceprint database updating step specifically refers to updating corresponding matched voiceprint data in a database by using model output so as to ensure timeliness, and the voiceprint data is expressed by the following formula: ; Wherein, the In order to identify the result of the recognition, Is a super parameter.

Description

Voiceprint recognition method based on double-feature branch structure Technical Field The invention relates to the technical field of deep learning voiceprint recognition, in particular to a voiceprint recognition method based on a double-feature branch structure. Background Voiceprint recognition has undergone rapid development from a statistical approach based on gaussian mixture models and i-vector, etc., to an end-to-end approach of x-vector, etc., in combination with deep neural networks. The current research mainly focuses on better feature extraction and modeling strategies, such as combining attention mechanisms, fusing multi-scale features and the like, so as to realize more robust voice identification in a complex environment. The current method mainly extracts Mel cepstrum or other single features by a single path, ignores multi-scale information of the voice signal, and has the defect of combining feature global and local information in a depth model. In summary, the existing voiceprint recognition method has the problems of poor extraction effect on low-quality voice signal characteristics and easy loss of voiceprint characteristic information Disclosure of Invention The invention aims to provide a voiceprint recognition method based on a double-feature branch structure, which solves the problems that the existing voiceprint recognition method has poor extraction effect on low-quality voice signal features and voiceprint feature information is easy to lose. In order to achieve the above object, the present invention provides a voiceprint recognition method based on a dual-feature branch structure, the voiceprint recognition method based on the dual-feature branch structure comprising the steps of: Preprocessing an original voice signal, and respectively extracting Mel cepstrum features and wavelet transformation features to form two feature branches; inputting the Mel cepstrum features into a self-attention module for sequence modeling of the multi-scale time-frequency features; inputting wavelet transformation characteristics into a convolution TDNN network, and performing time delay convolution processing on the wavelet coefficients after segmentation or framing; feature fusion is carried out on the two branch outputs, and fused voiceprint feature representation is obtained; And inputting the fused voiceprint characteristic representation into an SE module and an average statistics pooling layer, connecting the SE module with a linear layer, and finally carrying out speaker classification or similarity measurement by adopting AAM-Softmax loss. In the step of preprocessing an original voice signal and respectively extracting Mel cepstrum features and wavelet transformation features to form two feature branches, the extraction of the Mel cepstrum features satisfies the following formula: Where X [ k ] is the Fourier transform of the signal, and B m [ k ] is the mask of the Mel frequency cepstrum, which can be calculated from the following formula: k is the number of frequency domain sampling points, and M is the number of Mel filters. In the step of preprocessing an original voice signal and respectively extracting Mel cepstrum features and wavelet transformation features to form two feature branches, the extraction of the wavelet transformation features satisfies the following formula: Where x (t) is the input signal, ψ is the mother wavelet function, a is the scale factor, and b is the panning factor. The specific content of the step of inputting the Mel cepstrum features into the self-attention module and performing the sequence modeling of the multi-scale time-frequency features comprises the following steps: Linear mapping results Q, K, V multiplying Mel cepstrum feature matrix X by the learnable parameter matrices W h(Q)、Wh(K) and W h(V), respectively: Dot product of Q h and K h and divide by Normalizing, obtaining an attention weight matrix by using softmax, and multiplying the attention weight matrix by V h: For all attention heads (h=1,.), the output of H) is spliced, and multiplied by a leachable output map W O: MultiHead(X)=[head1|head2|…|headH]·WO。 In the step of inputting the wavelet transformation characteristics into a convolution TDNN network and carrying out time delay convolution processing on the wavelet coefficients after segmentation or framing, the convolution TDNN network carries out time delay convolution processing on the wavelet transformation characteristics after segmentation or framing, and captures local characteristics through a layer normalization and convolution space gating unit and then projects the local characteristics back to the original dimension. In the step of feature fusion of the two branch outputs, a covariance transformation or channel normalization mechanism is optionally introduced to align feature distribution, and the feature distribution is restored into a fused feature map through linear mapping, which is expressed as: Y=ω1·Ymfcc+ω2·Ywav。 In