CN-116861327-B - Speech recognition countermeasure sample defense method and system based on frequency spectrum characteristics

CN116861327BCN 116861327 BCN116861327 BCN 116861327BCN-116861327-B

Abstract

The invention discloses a voice recognition countermeasure sample defense method and system based on frequency spectrum characteristics. The known clean samples are respectively attacked by white box attack and black box attack to generate countermeasures, the simulated clean samples are obtained by the clean samples and the generated countermeasures through a plurality of voice disturbance removal methods, the samples are simultaneously converted into a frequency domain, the similarity of each frame of a spectrogram is utilized to obtain a feature vector, and a label is marked on the similarity feature vector to train the classifier. When an external unknown countermeasure sample arrives, firstly, a simulated clean sample is generated through a plurality of voice disturbance removal methods, the countermeasure sample and the generated clean sample are preprocessed and converted into a frequency domain, the frequency spectrum similarity feature vector is input into a classifier, the classifier is directly discarded if the classifier detects the countermeasure sample, and the classifier detects a normal sample, and then the classifier enters a voice recognition system to finish defense. The invention can be suitable for various voice attack resisting sample attack modes, and improves the defending success rate.

Inventors

JI SHUNHUI
GE ZHICHENG
LI XINYU
ZHANG PENGCHENG

Assignees

河海大学

Dates

Publication Date: 20260508
Application Date: 20230705

Claims (10)

1. A method for defending a challenge sample by speech recognition based on spectral features, comprising the steps of: Step 1, denoising an unknown original sample from the outside by using a voice disturbance removal method to obtain a simulated clean sample 1, wherein the voice disturbance removal method comprises a plurality of wiener filtering method, spectral subtraction method, ACC compression method and speex compression method; Step 2, preprocessing an original sample and the obtained simulated clean sample 1, and converting the preprocessed sample and the simulated clean sample into a frequency domain to obtain spectrogram characteristics of the preprocessed sample; Step 3, training a classifier, namely respectively obtaining countermeasure samples of known clean voice samples by using a white box attack mode and a black box attack mode, denoising the clean voice samples and the generated countermeasure samples by using a voice disturbance removal method to obtain a simulated clean sample 2, preprocessing the clean voice samples, the countermeasure samples and the simulated clean sample 2 obtained after processing the clean voice samples and the countermeasure samples, converting the preprocessing operation into a frequency domain to obtain spectrogram characteristics of the clean voice samples and the challenge samples, comparing similarity of each frame of spectrograms of the two types of samples to obtain characteristic vectors of the spectrograms, marking the vectors with labels to judge whether the samples are the countermeasure samples, and training the classifier by using the obtained result; And step 4, inputting the similarity feature vector corresponding to the spectrogram in the step 2 into a trained classifier to detect whether the original sample is an countermeasure sample or not, and finishing the whole defense framework.
2. The method for defending a voice recognition countersample based on spectrum features according to claim 1, wherein four voice disturbance removal methods are selected, namely a wiener filtering method, a spectral subtraction method, an ACC compression method and a speex compression method, and when the similarity of each frame of a spectrogram is calculated, the similarity of four simulated clean samples obtained by the four voice disturbance removal methods and the sample before disturbance removal is averaged.
3. A method of speech recognition challenge sample defense based on spectral features according to claim 1, wherein the spectral subtraction used in step 1 is a modified spectral subtraction, the noise estimation of which comprises the steps of: firstly, removing a mute section through energy threshold division by using a voice activity detection method; Then estimating a noise section in the sample, distinguishing a noise frame from a voice frame through the zero crossing rate characteristic of voice activity detection, and finding out the section where the noise is located; and finally, selecting the noise of a plurality of frames in the high-frequency voice section according to the disturbance distribution position, and averaging the noise to estimate the whole section of noise.
4. The method for defending a voice recognition challenge sample based on spectral features according to claim 1, wherein the step 2 comprises the steps of: step 21, performing a series of preprocessing operations on the original sample and the obtained simulated clean sample 1, including pre-emphasis, framing and windowing; step 22, converting the original sample and the obtained voice signal of the simulated clean sample 1 from the time domain to the frequency domain to obtain a spectrogram; Step 23, the simulated clean sample 1 obtained by each disturbance removal method corresponds to the original sample spectrogram features one by one, the similarity between the corresponding spectrograms is calculated, and the classifier is prepared for input.
5. A method of defending a voice recognition challenge sample based on spectral features as recited in claim 1, wherein said step 3 comprises the steps of: step 31, a white box attack method and a black box attack method are utilized to attack the clean sample to form a corresponding countermeasure sample; step 32, denoising the clean sample and the generated countermeasure sample by using the same voice disturbance removal method in the step 1 to obtain a corresponding simulated clean sample 2; Step 33, performing a series of preprocessing operations on the clean sample, the challenge sample and the simulated clean sample 2 obtained after processing the same, including pre-emphasis, framing and windowing; Step 34, converting the clean sample, the countermeasure sample and the voice signal of the simulated clean sample 2 obtained after processing the clean sample and the countermeasure sample from a time domain to a frequency domain to obtain a spectrogram; Step 35, the simulated clean sample 2 and the clean sample obtained by each disturbance removal method are respectively in one-to-one correspondence with the generated spectrum pattern characteristics of the countermeasure sample; step 36, respectively calculating the average value of four correlations between the clean sample data set or the countermeasure sample data set and each frame of each simulated clean sample data set, labeling the obtained vector of the clean sample with a clean label, labeling the vector obtained by the countermeasure sample with a countermeasure label, and training the classifier; And 37, performing feature extraction on the correlation coefficient vector by using a convolutional neural network, and training the parameters of the convolutional neural network by using a cross entropy loss function.
6. The method for defending a voice recognition challenge sample based on spectral features of claim 1, wherein the similarity calculation of the spectrograms uses pearson correlation coefficients.
7. The method according to claim 1, wherein in the step 4, if the classifier identifies the challenge sample, the challenge sample is directly discarded, and if the classifier identifies the challenge sample, the challenge sample is input to the speech recognition system for the next step, so as to complete the whole defense system.
8. A spectral feature-based speech recognition challenge sample defense system, comprising: The original sample denoising module is used for denoising an unknown original sample from the outside by using a voice denoising method to obtain a simulated clean sample 1, wherein the voice denoising method comprises a plurality of wiener filtering method, spectral subtraction method, ACC compression method and speex compression method; The original sample frequency domain conversion module is used for preprocessing an original sample and the obtained simulated clean sample 1, and converting the original sample and the obtained simulated clean sample into a frequency domain to obtain spectrogram characteristics of the original sample and the obtained simulated clean sample; The classifier training module is used for respectively obtaining countermeasure samples of known clean voice samples by utilizing a white box attack mode and a black box attack mode, denoising the clean voice samples and the generated countermeasure samples by using a voice disturbance removal method to obtain a simulated clean sample 2, preprocessing the clean voice samples, the countermeasure samples and the simulated clean sample 2 obtained after the processing of the clean voice samples and the countermeasure samples, converting the preprocessing operation into a frequency domain to obtain spectrogram characteristics of the clean voice samples and the countermeasure samples; And the defense detection module is used for inputting the similarity feature vector corresponding to the spectrogram obtained by the original sample frequency domain conversion module into a trained classifier to detect whether the original sample is an countermeasure sample or not, so as to complete the whole defense framework.
9. A computer system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when loaded to the processor implements the steps of the spectral feature based speech recognition challenge sample defense method according to any of claims 1-7.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the spectral feature based speech recognition challenge sample defense method according to any of claims 1-7.

Description

Speech recognition countermeasure sample defense method and system based on frequency spectrum characteristics Technical Field The invention relates to a voice recognition countermeasure sample defense method and system based on spectrum characteristics, and belongs to the field of artificial intelligence testing. Background With the rapid development of computers and the Internet, voice recognition is becoming a key intelligent technology for realizing efficient interaction of human and machine. Speech recognition technology, also known as automatic speech recognition (Automatic Speech Recognition, ASR), aims at converting the lexical content in human speech into computer-readable input. The application of voice recognition greatly improves the efficiency of man-machine interaction, and is spreading over the aspects of people's daily life. Speech recognition is mainly implemented based on deep learning techniques, but challenge samples attack deep learning models, resulting in erroneous decisions being made when recognizing speech. The hacker only needs to add some carefully designed disturbance which is difficult to be perceived by human beings into the voice fragments, and the voice recognition model can recognize the voice as messy codes or blank voice fragments without speaking, so that the fight against the sample attack aiming at the voice recognition can pose a security threat to related applications. Most speech recognition models are affected by challenge with samples, and research on challenge with sample defense is particularly important for the emergence of such challenge-against means. The defense of the countermeasure sample is more studied in the fields of image recognition and text, and the defense of the voice countermeasure sample is a newer field, and the method and the defense method of the image and text countermeasure sample have a certain difference. For the voice countermeasure sample, the defense scheme system researched by the current scholars is approximately similar, and the defense effect of many defense methods is greatly reduced after the attack is slightly changed due to the endangered pattern of the countermeasure sample. Recently Qiang Zeng et al propose a novel method for detecting an countermeasure sample based on the established fact that different ASR systems use different architectures, parameters and training data sets to cause the difference of the same audio transcription, and combine the thought of multi-version programming, however, the method has the defects that the success rate of defense is not high, the defense effect of a classifier on different samples is large, and the improvement is worth. Disclosure of Invention The invention aims to provide a method and a system for defending a sample by voice recognition based on frequency spectrum characteristics, which can be suitable for various voice defending sample attack modes and improve defending success rate, and aims to provide the method and the system for defending the sample by voice recognition based on frequency spectrum characteristics, which are used for considering that the characteristics of various voice audios are different and the defending scheme is not universal, and most of the defending schemes are specially formed for defending single attack and some defending schemes have limitation on the improvement of the accuracy of voice recognition. In order to achieve the aim of the invention, the voice recognition countermeasure sample defense method based on the frequency spectrum features comprises the following steps: Step 1, denoising an unknown original sample from the outside by using a voice disturbance removal method to obtain a simulated clean sample 1, wherein the voice disturbance removal method comprises a plurality of wiener filtering method, spectral subtraction method, ACC compression method and speex compression method; Step 2, preprocessing an original sample and the obtained simulated clean sample 1, and converting the preprocessed sample and the simulated clean sample into a frequency domain to obtain spectrogram characteristics of the preprocessed sample; training a classifier, namely respectively obtaining countermeasure samples of known clean voice samples by using a white box attack mode and a black box attack mode, denoising the clean voice samples and the generated countermeasure samples by using a voice disturbance removal method to obtain a simulated clean sample 2; preprocessing a clean voice sample, a countermeasure sample and a simulated clean sample 2 obtained after processing the clean voice sample and the countermeasure sample, and converting the preprocessing operation into a frequency domain to obtain spectrogram characteristics of the clean voice sample and the countermeasure sample; Comparing the similarity of each frame of spectrograms of the two types of samples to obtain a characteristic vector, marking a label on the vector to judge whether the sample is an