CN-116416997-B - Intelligent voice fake attack detection method based on attention mechanism

CN116416997BCN 116416997 BCN116416997 BCN 116416997BCN-116416997-B

Abstract

The invention discloses an intelligent voice fake attack detection method based on an attention mechanism, which comprises the steps of converting a voice sample from a time domain to a frequency domain, analyzing, considering the influence of different filters on characteristic performance, filtering a frequency spectrum by using a plurality of filters, extracting three different voice sample voiceprint characteristics of a logarithmic power spectrum, a Mel frequency cepstrum coefficient and a linear frequency cepstrum coefficient, training a fake voice attack detection model based on the attention mechanism and a residual error network, carrying out self-adaptive characteristic selection by using the attention mechanism, enhancing the effective characteristic with discrimination, suppressing noise and redundancy characteristics, carrying out advanced characteristic extraction and learning by using the residual error network, carrying out validity detection on the received voice sample by using the trained fake voice attack detection model, and grading the voice sample, if the grade exceeds a threshold value, judging that the voice sample is fake voice, otherwise, judging that the voice sample is fake voice. The method has high accuracy, high efficiency and strong generalization capability.

Inventors

DENG XIANJUN
HUANG YONGLING
YI LINGZHI
XIA YUNZHI
LIU SHENGHAO
YANG TIANRUO
ZHOU XINLEI

Assignees

华中科技大学

Dates

Publication Date: 20260512
Application Date: 20230310

Claims (9)

1. The intelligent voice fake attack detection method based on the attention mechanism is characterized by comprising the following steps: (1) The method comprises the steps of converting a voice sample from a time domain to a frequency domain for analysis, considering the influence of different filters on characteristic expression, filtering a frequency spectrum by using a plurality of filters, and extracting three different voice sample voiceprint characteristics of logarithmic power spectrum, mel frequency cepstrum coefficient and linear frequency cepstrum coefficient; (2) Training a fake voice attack detection model based on an attention mechanism and a residual error network, performing self-adaptive feature selection by using the attention mechanism, enhancing effective features with discrimination, suppressing noise and redundant features, and then performing deep feature extraction and learning by using the residual error network, wherein the method specifically comprises the following steps: (2.1) defining a fake voice attack detection model based on an attention mechanism and a residual error network, wherein the fake voice attack detection model consists of an attention feature enhancement module, a depth feature extraction module and a fully-connected classification module, and a loss function is designed by using the voiceprint features extracted in the step (1), a parameter optimizer is selected, and the network is trained by using a gradient descent and back propagation algorithm to obtain a fake voice attack detection model; (2.2) voiceprint features extracted in (1) As the input of the fake voice detection model, the attention feature enhancement module is utilized to dynamically select and process important information in the high-dimensional features; (2.3) feature map to be enhanced Input depth feature extraction module, C is feature The model receptive field is enlarged through 6 superimposed residual blocks, the correlation between the voice frames and the correlation between the frequency components are captured, and the deeper features are extracted: (2.4) feature obtained by the depth feature extraction Module Inputting into a linear classifier composed of two full-connection layers fc1 and fc2 for logic reasoning, using LeakyRelu activation function between two full-connection layers for nonlinear transformation, and finally outputting a 2-dimensional vector through Softmax activation function A probability distribution of the representation; (2.5) calculating a loss function according to the prediction result vector V, back-propagating the calculation gradient, and updating the parameter model by using an Adam optimizer; (3) And carrying out validity detection on the received voice sample by using the trained fake voice attack detection model, grading the voice sample, if the grading exceeds a threshold value, judging that the sample is fake voice, otherwise, judging that the sample is fake voice.
2. The intelligent voice forgery attack detection method based on the attention mechanism as set forth in claim 1, wherein the step (1) specifically includes: (1.1) unifying the lengths of all voice samples in an original data set, and adjusting the original voice samples to the same length in a filling and cutting-off mode; (1.2) dividing a voice sample into a plurality of voice frames according to the number of sampling points, and respectively extracting voiceprint features from the single voice frames; (1.3) setting N as the sampling point number, For speech frame signals, multiplying each frame of speech by a hamming window function Obtaining a windowed signal : (1.4) Performing fast fourier transform on the windowed signal, and calculating a frequency spectrum thereof: (1.5) taking the influence of different filters on feature expression into consideration, extracting voice sample voiceprint features based on three different filters, wherein the voice sample voiceprint features comprise the steps of calculating power spectrum features based on voice frequency spectrums, eliminating convolution noise through logarithmic transformation, calculating a Mel cepstrum coefficient and calculating a linear frequency cepstrum coefficient.
3. The intelligent voice forgery attack detection method based on the attention mechanism as claimed in claim 2, wherein said step (1.5) specifically comprises the following sub-steps: (1.5.1) calculating power spectral features based on the speech spectrum and removing convolution noise by logarithmic transformation: (1.5.2) calculating mel-frequency cepstrum coefficients based on the speech spectrum; (1.5.2.1) filtering the spectrum with a mel filter: the mel frequency transform formula is: wherein M is the number of Mel filters; (1.5.2.2) discrete cosine transforming the mel-pattern; wherein L is the MFCC feature order, and the first 30 coefficients are selected as the MFCC static feature in the step; (1.5.2.3) carrying out first-order derivation and second-order derivation on the static features to obtain voice dynamic features; (1.5.3) calculating linear frequency cepstrum coefficients based on the speech spectrum; (1.5.3.1) filtering the spectrum using a linear triangular filter: the frequency response of the triangular filter is defined as follows: Wherein M is the number of triangular filters, and 60 linear triangular filters are adopted to filter the frequency spectrum in the step; (1.5.3.2) discrete cosine transforming the linear spectrogram to eliminate the correlation between the signal values of different orders: wherein L is LFCC feature orders; (1.5.3.3) performing first-order derivation and second-order derivation on the static feature to obtain a voice dynamic feature representation.
4. The intelligent voice falsification attack detection method based on the attention mechanism of claim 1, wherein in the step (2.1), an attention feature enhancement module consists of two parts of channel attention and space attention, and is used for carrying out self-adaptive feature selection from high-dimensional and multi-channel input, enhancing important features and reducing redundant features, an enhanced feature map is input into a depth feature extraction network, the network consists of 6 groups of residual blocks, local correlation among voice features is captured through convolution operation, a comprehensive feature descriptor about voice samples is obtained based on the features of higher dimension by utilizing a multi-layer convolution expansion receptive field, finally, a fully connected classification module consisting of two fully connected layers fc1 and fc2 is used for carrying out probability prediction on the voice samples and calculating scores, if the scores exceed a threshold value, the samples are judged to be real human voice, and otherwise, the samples are judged to be falsified voice.
5. The intelligent speech fake attack detection method based on attention mechanism of claim 1, wherein in the step (2.2), fake speech detection model weight and bias are initialized, voiceprint features extracted in the step (1) are used as input of an attention feature enhancement module in the step (2.1), and for the number of input channels, C is the size The attention feature enhancement module generates attention force diagrams through global maximum pooling and global average pooling operations along the channel dimension and the space dimension, calculates a weight matrix for the input feature diagrams, does not change the size of the input feature diagrams, only self-adaptively assigns weights to original features, and provides enhanced features for a subsequent depth feature extraction module.
6. The intelligent voice forgery attack detection method based on the attention mechanism as set forth in claim 1, wherein the step (2.2) specifically includes: (2.2.1) for the input feature map, each channel of the feature map is treated as a feature detector, and the attention feature enhancement module weights different channel features by channel attention, namely, weighting high-dimensional features with respect to the speech map Global max pooling and global average pooling to obtain two feature descriptors for speech And , And The characteristics of each channel are globally integrated to reflect the overall condition of the characteristics of each channel, and the two characteristic descriptors are sent into the channel which is composed of two characteristics The shared multi-layer perceptron network formed by Conv acquires corresponding attention weights And Calculating And (3) with The sum is utilized to unify the weight value to the range of the interval of [0,1] by using the Sigmoid function, and the final channel attention weight is obtained The acquired channel attention is utilized to assign weights to different channel characteristics, and the first-round characteristic enhancement is carried out: ; (2.2.2) calculating a spatial attention map to find efficient information on a high-dimensional feature space, i.e., a feature map to be enhanced by a channel Cut into along the channel direction One-dimensional vectors Respectively calculate Maximum and average of (a) to obtain a spatial attention map And Space attention is sought And Splice and pass Conv7 7, Fusing to obtain the final spatial attention weight And performing a second round of feature enhancement by using the acquired spatial attention to assign different weights to different positions in the high-dimensional feature: 。
7. The intelligent voice forgery attack detection method based on the attention mechanism as set forth in claim 1, wherein the step (2.3) specifically includes: each residual block consists of a convolution branch and a residual branch, and is characterized by Will be the input of a convolution branch and a residual branch, the convolution branch comprising two Convolution, first A convolution step size of 1 followed by a Dropout operation with a batch normalization and deactivation probability of 0.5 followed by an input step size of 3 Conv, residual branches include one Convolution for adapting to the change of the size of the feature map in convolution operation, summing the outputs of convolution branches and residual branches, performing batch normalization operation again, preventing gradient from disappearing by LeakyReLU activation function, and obtaining final high-level representation by superposition learning of six residual blocks 。
8. The intelligent speech forgery attack detection method based on the attention mechanism as set forth in claim 1, wherein said step (2.5) specifically includes: Taking a label in an original data set as an expected output of a network, taking the probability predicted in the step (2.4) as an actual prediction result of the network, and designing a target loss function between the expected output and the actual prediction output according to the network model constructed in the step (2.1), wherein the expression is as follows: Wherein, the For the probability values predicted by the model, And carrying out iterative training on the model by using a back propagation algorithm according to the designed loss function, minimizing the classification loss function, and realizing the optimal network model.
9. The intelligent speech forgery attack detection method based on attention mechanism as claimed in claim 1 or 2, wherein said step (3) specifically comprises: (3.1) extracting voiceprint features of the voice audio to be detected, inputting the voiceprint features into a attention feature enhancement module, enhancing effective features and inhibiting ineffective features: Inputting the voiceprint characteristics of the kth voice sample I k in the data set I to be detected into the trained model in the step (2), and generating the channel attention weight through global pooling, average pooling operation and MLP convolution of the characteristic space Multiplying the original feature map with the channel attention weight by global pooling in the channel direction, averaging pooling operations and step size 1 Conv acquires spatial attention weights Feature enhancement is performed using the following formula: (3.2) obtaining probability distribution of the category to which the sample belongs after the enhanced feature map passes through a depth feature extraction module consisting of six residual blocks and two full-connection layers; (3.3) calculating a score of the voice sample model, if the score is larger than a set threshold value, considering the sample as a real voice, otherwise, considering the sample as a fake voice, wherein the score calculation expression is as follows: Wherein, the Representing the probability that the input speech is a real human voice, Representing the probability that the input speech is a spurious speech.

Description

Intelligent voice fake attack detection method based on attention mechanism Technical Field The invention belongs to the technical field of security of the Internet of things, and particularly relates to an intelligent voice counterfeiting attack detection method based on an attention mechanism. Background The task of fake speech detection is to identify whether a given speech sample is a sound made by a real human or a fake speech synthesized by an electronic device. In recent years, with the continuous development of intelligent voice technology, intelligent voice assistants have been widely integrated in various mobile devices and internet of things devices. However, intelligent voice assistants present great convenience to people and also present a significant challenge, and an attacker can bypass the authentication of the intelligent voice system by maliciously recording and playing back or synthesizing voice, which poses a great threat to the privacy and property security of the user. In this regard, many researchers have conducted studies on spoof voice detection. Because the actual scene is complex and various, the environmental noise is large, the attack technologies such as voice synthesis are more and the influence of factors such as continuous change is more, the task of detecting the fake voice has great difficulty and challenge. In order to enhance the robustness and generalization capability of a model, the existing methods are used for carrying out double identity verification by utilizing physical quantities such as sensor data of a wearable device, ultrasonic waves, millimeter waves or wireless signals in the environment, the method needs additional equipment or has strong position limitation and has certain influence on usability, in addition, based on the characteristic that voiceprint characteristics of sound emitted by human sound and electronic speakers are different, a plurality of researches take the voiceprint characteristics as input, a Gaussian Mixture Model (GMM), an x-vector and other machine learning algorithms are used for classifying voice samples, but the detection accuracy of the machine learning method is lower, and for better learning characteristic information, the method based on deep learning is widely applied, but due to the fact that the voice characteristics have high-dimensional attribute, a plurality of redundant information are present, the model is easy to be fitted excessively, the unknown attack capability is insufficient, and the method based on a serial model circular neural network (RNN), a long-short-term memory neural network (LSTM) has the problems of large calculated amount, long time consumption and low model efficiency. Disclosure of Invention Aiming at the defects or improvement demands of the prior art, the invention provides an intelligent voice counterfeiting attack detection method based on an attention mechanism, which aims to enhance differential feature expression by using the attention mechanism in the process of identifying voice legitimacy, improve the distinction degree of a human voice and machine voice feature matrix, solve the problems of poor generalization capability and insufficient unknown attack capability of the existing method, identify the counterfeiting voice by a classifier model based on a residual network, extract local feature correlation by using a convolution kernel, and aim to improve detection accuracy and simultaneously consider model efficiency; In order to achieve the above purpose, the present invention provides an intelligent voice falsification attack detection method based on an attention mechanism, comprising the following steps: (1) The method comprises the steps of converting a voice sample from a time domain to a frequency domain for analysis, considering the influence of different filters on characteristic expression, filtering a frequency spectrum by using a plurality of filters, and extracting three different voice sample voiceprint characteristics of logarithmic power spectrum, mel frequency cepstrum coefficient and linear frequency cepstrum coefficient; (2) Training a fake voice attack detection model based on an attention mechanism and a residual error network, performing self-adaptive feature selection by using the attention mechanism, enhancing effective features with discrimination, suppressing noise and redundant features, and then performing advanced feature extraction and learning through the residual error network; (3) And carrying out validity detection on the received voice sample by using the trained fake voice attack detection model, grading the voice sample, if the grading exceeds a threshold value, judging that the sample is fake voice, otherwise, judging that the sample is fake voice. In one embodiment of the present invention, the step (1) specifically includes: (1.1) unifying the lengths of all voice samples in an original data set, and adjusting the original voice samples to the same l