CN-116504251-B - Speech analysis identity authentication method based on sound field reconstruction

CN116504251BCN 116504251 BCN116504251 BCN 116504251BCN-116504251-B

Abstract

The invention discloses a voice analysis identity authentication method based on sound field reconstruction, which comprises distance sensing, sound field reconstruction, sound field extraction and model training and inference, wherein the distance sensing is that a chirp signal is sent out through a loudspeaker and received, the distance from a user to a mobile phone is obtained by a cross-correlation method, the sound field reconstruction is that a corresponding transfer function is obtained according to the measured distance through establishing an impulse response database related to the distance, then the sound field of a verification position is reconstructed to the sound field of a registration position, the sound field extraction respectively carries out signal processing on two sound channels of a voice signal after the sound field reconstruction, field lines are extracted, and the model training and inference are that the voice authentication model is constructed by using the registration field lines and the verification stage reconstruction field lines.

Inventors

XU WENYUAN
JI XIAOYU
YAN CHEN
LI XINFENG
LV ZHI
CHEN YIRAN
ZHENG ZHICONG
ZHONG JINGJING

Assignees

浙江大学

Dates

Publication Date: 20260508
Application Date: 20230418

Claims (6)

1. A voice analysis identity authentication method based on sound field reconstruction is characterized by comprising the following steps: Step 1 distance sensing Step 1.1, sending a ranging signal by using a mobile phone loudspeaker, and receiving an echo signal returned after the ranging signal is reflected by a user by using a microphone at the bottom of the mobile phone; step 1.2, removing background noise from echo signals by using a Butterworth band-pass filter; step 1.3, finding a direct path and an echo path through a time domain image of the echo signal, wherein delta t represents the time difference between the echo path and the direct path, and the distance between a user and the mobile phone can be passed through C is the light speed obtained by calculation; Step 2, sound field reconstruction Generating constant pulse s (t) by utilizing a stable sound source, utilizing top and bottom microphones of a mobile phone to receive signals at different distances from a user, recording as s k (d, t), setting k E [1,2] as h k (d, t), enabling k E [1,2] to be impulse responses of the top and bottom microphones, carrying out deconvolution by using a formula s k (d,t)＝h k (d, t) s (t), enabling adjacent two times to be separated by 1cm to receive signals, storing h k (d, t) at different distances, and forming an impulse response database related to the distances, wherein d is the distance, and t is time; Step 2.2, when a user registers a sound field, obtaining the distance de between the user and a mobile phone through distance sensing, inquiring an impulse response database to obtain an impulse response h k (de, t) in a registering stage, and similarly, when the user verifies, obtaining the distance dv between the user and the mobile phone, and inquiring the impulse response h k (dv, t) when the impulse response database is verified; Step 2.3, the received user-generated sound signal x (t) in the verification stage is combined with the impulse response in verification and the inverse impulse response in registration The reconstruction signal x ′ (t) can be obtained by convolution in turn, and the specific formula is as follows: step 3, sound field extraction Step 3.1, noise removal and mute segment removal are carried out on the reconstructed signal; Step 3.2, short time Fourier transform processing is carried out on the reconstructed signal, and a formula is utilized Calculating the sound pressure ratio between two channels of the reconstructed signal to represent the sound field at the frequency f, and then the sound field at a certain frame is expressed as S(p1,p2)=[Sr(p1,p2,f1),Sr(p1,p2,f2),...,Sr(p1,p2,fn)] Wherein, p1 and p2 are the positions of two microphones respectively, f is the frequency, and n is the total dimension number of the frequency; for the sound field of the reconstructed signal, an n-dimensional sound field feature vector SFF (p 1, p 2) is obtained by adopting a long-time average normalization method: S i (p 1, p 2) is the sound field under the ith frame; Step4, model training and identity inference Step 4.1, user registration, fitting the user sound field feature vectors SFF (p 1, p 2) by using a Gaussian mixture model to form a corresponding speaker model, sending out a plurality of sections of voice signals by the registered user, extracting n-dimensional sound field feature vectors SFF (p 1, p 2) from the voice signals of the registered user, modeling by using the Gaussian mixture model, and defining the mixing density p (SFF|lambda s ) of the registered user s as for the SFF (p 1, p 2) Wherein lambda s represents a speaker model obtained by modeling a registered user s by adopting a Gaussian mixture model; representing the j-th Gaussian component of registered user s The j-th Gaussian component of registered user s The calculation formula of (2) is as follows: Wherein μ j is the average vector of n-dimensional sound field feature vectors; Covariance matrix of feature vector of n-dimensional sound field; estimating and converging model parameters by using an iterative expectation maximization algorithm; and 4.2, in the verification stage, processing the n-dimensional sound field feature vector of the speaker reconstruction signal by using the trained Gaussian mixture model to obtain the similarity between the speaker and the registered user, if the similarity exceeds a preset threshold, the speaker identity authentication passes, otherwise, the speaker identity authentication fails and is refused.
2. The voice analysis identity authentication method based on sound field reconstruction according to claim 1, wherein in the step 1.1, the ranging signal is composed of 5 mono chirp signals with a duration of 0.25ms and a frequency of 12kHz, and the adjacent chirp signals are spaced by 10ms.
3. The voice analysis identity authentication method based on sound field reconstruction according to claim 1, wherein in the step 1.3, the cross-correlation peak position after the direct path peak is the echo path, the points corresponding to the maximum value and the minimum value in the image of the direct path are selected, then two points are randomly selected within the range of about 0.05ms of the two points, the average value of the time corresponding to the 6 points is calculated, so as to determine the time of the direct path, the echo path also uses the same method to determine the 6 points, and the average value of the time corresponding to the 6 points is calculated as the time of the echo path.
4. The voice analysis identity authentication method based on sound field reconstruction according to claim 1, wherein in the step 2.1, the constant pulse s (t) is an exponential sinusoidal scan signal, and the non-periodic deconvolution is as follows Wherein ω1 and ω2 are the start frequency and the end frequency of s (T), T is the period, and e is the natural logarithm.
5. The voice analysis identity authentication method based on sound field reconstruction according to claim 1, wherein in the step 3.1, the noise reduction method is to reduce noise for each mono channel respectively, that is, let two microphones sense respective environment for 0.5s before the user sounds, that is, sample the environment noise; The silence segments are eliminated using existing voice biopsy algorithms.
6. The voice analysis identity authentication method based on sound field reconstruction of claim 1, wherein a user is required to face a microphone of a mobile phone during recording.

Description

Speech analysis identity authentication method based on sound field reconstruction Technical Field The invention belongs to the technical field of artificial intelligent voice assistant safety, and particularly relates to a voice analysis identity authentication method based on sound field reconstruction. Background The voice authentication system performs feature training and matching by extracting voice features of the user, so that the user is authenticated, and the voice authentication system is an identity authentication method widely applied. The method has low requirements on hardware, low cost and high user friendliness, becomes a mainstream personal authentication mode gradually, and is widely applied to various intelligent devices. The existing voice authentication system based on the sound field and the field lines is one of the voice authentication systems, the sound field characteristics are irrelevant to speaking contents, the voice information of the voice signals is reserved, and physiological identity information related to the sound channels, the mouth, the head, the trunk and the like of a user is introduced. Therefore, the system can detect the voice spoofing attack based on the sound field and the field line information, and is effectively applied to the field of speaker authentication and sounding body recognition. However, voice authentication systems based on sound fields and patterns are very sensitive to the distance between the user and the microphone, which can vary widely in sound field characteristics when placed near the mouth or slightly further. In practical applications, the system further requires a fixed distance between the user and the microphone each time the system is used, which greatly affects the convenience and user friendliness of the system in practical use. Disclosure of Invention In order to solve the defects in the background art, the invention provides a voice analysis identity authentication method based on sound field reconstruction, which improves the original voice authentication system of the sound field and the field lines, effectively solves the problem that the sound field is sensitive to distance, and improves the robustness and the user friendliness of the system. The invention is realized by adopting the following technical scheme: Step 1 distance sensing Step 1.1, sending a ranging signal by using a mobile phone loudspeaker, and receiving an echo signal (namely a signal returned after the ranging signal passes through a person) by using a microphone at the bottom of the mobile phone; step 1.2, removing background noise from echo signals by using a Butterworth band-pass filter; step 1.3, finding a direct path and an echo path through a time domain image of the echo signal, wherein delta t represents the time difference between the echo path and the direct path, and the distance between a user and the mobile phone can be passed through And c is the light speed obtained by calculation. Step 2, sound field reconstruction Generating constant pulse s (t) by utilizing stable sound sources such as a high-fidelity loudspeaker and the like, recording signals received by the top microphone and the bottom microphone of the smart phone at different distances from a user as s k (d, t), setting k epsilon [1,2] as h k (d, t), enabling k epsilon [1,2] to be impulse responses of the top microphone and the bottom microphone, carrying out deconvolution by using a formula s k(d,t)＝ hk (d, t) x s (t), enabling the interval between two adjacent distances to be 1cm for receiving the signals, storing h k (d, t) at different distances, and forming a pulse response database related to the distances, wherein d is the distance, and t is the time; Step 2.2, when a user registers a sound field, obtaining the distance de between a sound source (user) and a mobile phone through distance sensing, inquiring an impulse response database to obtain an impulse response h k (de, t) in a registering stage, and similarly, when the user verifies, obtaining the distance dv between the user and the mobile phone, and inquiring a transfer function h k (dv, t) when the impulse response database is verified; Step 2.3, the received user voice signal in the verification stage is combined with the impulse response in verification and the inverse impulse response in registration The reconstruction signal x ′ (t) can be obtained by convolution in turn, and the specific formula is as follows: step 3, sound field extraction Step 3.1, noise removal and mute segment removal are carried out on the reconstructed signal; Step 3.2, respectively performing short-time Fourier transform processing on the two-channel signals, and utilizing a formula Calculating the sound pressure ratio between two channels of the reconstructed signal to represent the sound field at the frequency f, and then the sound field at a certain frame is expressed as S(p1,p2)=[Sr(p1,p2,f1),Sr(p1,p2,f2),...,Sr(p1,p2,fn)] Wherein, p1 and p