CN-121999784-A - Fusion sound source positioning and voiceprint recognition method, chip and electronic equipment

CN121999784ACN 121999784 ACN121999784 ACN 121999784ACN-121999784-A

Abstract

The invention relates to acoustic signal processing and biological recognition technology, in particular to a fusion sound source positioning and voiceprint recognition method, a chip and electronic equipment. The method comprises the following specific steps of synchronously collecting environmental audio signals through an audio collection module array and processing the environmental audio signals to obtain audio data. And analyzing the audio data to obtain the depth fusion voiceprint features. Judging whether the registered user feature library contains the deep fusion voiceprint features, if so, outputting the identity of the sounder, and if not, marking and giving an identification number of the sounder. And positioning the sound source based on the audio data to obtain the space coordinates of the sound source. And establishing an identity-position association model, binding the identity information output by voiceprint recognition with the space coordinates obtained by sound source positioning, and generating 'identity-position' association data. The technology of the invention is integrated with innovation, realizes accurate linkage of 'identity-position', has high real-time performance and strong anti-interference capability, and is suitable for complex scenes.

Inventors

GU YUCONG
ZHUANG YETAO
PENG BO

Assignees

杭州智芯科微电子科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260128

Claims (10)

1. The fusion sound source positioning and voiceprint recognition method is characterized by comprising the following steps of: S1, synchronously acquiring an environmental audio signal through an audio acquisition module array and processing the environmental audio signal to obtain audio data; S2, analyzing the audio data, extracting acoustic features and carrying out feature fusion to obtain depth fusion voiceprint features; S3, comparing the depth fusion voiceprint features with a registered user feature library, judging whether the registered user feature library contains the depth fusion voiceprint features, if so, outputting the identity of the sounder, and if not, marking and giving an identification number of the sounder; S4, positioning the sound source based on the audio data to obtain the space coordinates of the sound source; and S5, binding the identity information output by voiceprint recognition with the space coordinates obtained by sound source positioning to generate 'identity-position' associated data.
2. A method for positioning and identifying sound source and sound print based on the integrated sound source is characterized by comprising the steps of carrying out time sequence calibration on each audio acquisition module, eliminating calculated deviation value caused by clock deviation, calibrating time of each audio acquisition module through the deviation value, uniformly sending synchronous sampling trigger command, starting audio acquisition at the same time point by all the audio acquisition modules, requiring the time difference corresponding to cross correlation peak value of any two channel signals to be less than or equal to 1 mu s through calculation, otherwise, re-executing clock calibration, removing background noise and power supply interference, filtering low-frequency noise, high-frequency noise and power supply interference in the environment, retaining effective frequency band of a voice signal, carrying out compensation voice-frequency attenuation on the filtered audio signal, converting continuous signals into short-time stable frames through framing operation, windowing to restrain spectrum leakage of FFT, identifying starting point and ending point of the voice signal, eliminating voice-free mute frame, and outputting effective voice frame sequence as audio data.
3. The method for positioning and identifying sound source and voiceprint according to claim 2, wherein the method for calculating deviation value comprises analyzing standard time and 1PPS synchronous signal, wherein the time precision of 1PPS signal is less than or equal to 1 μs, calibrating local clock of each audio acquisition module based on 1PPS signal, and calculating deviation value between local clock and standard time Where t' is the local sampling trigger time and t 0 is the standard time.
4. The method for fusion sound source localization and voiceprint recognition according to claim 1, wherein the audio signal is compensated by: Wherein alpha is a pre-emphasis coefficient, i is the number of an audio acquisition module, n is the index number of a discrete sampling point, f s is the sampling frequency of an audio signal, P n i is the pre-emphasis signal value of the nth discrete sampling point of the ith audio acquisition module, P n i 'is the pre-filtered input signal value of the nth discrete sampling point of the ith audio acquisition module, and P n-1 i ' is the pre-filtered input signal value of the nth-1 discrete sampling point of the ith audio acquisition module.
5. The fusion sound source positioning and voiceprint recognition method according to claim 1 is characterized by comprising the specific steps of S21, carrying out acoustic feature preprocessing on audio data, and S22, carrying out depth feature extraction on the audio data preprocessed by the acoustic feature to obtain the depth fusion voiceprint feature.
6. The method for merging sound source localization and voiceprint recognition according to claim 1, wherein the registered user feature library construction method comprises training a data set, a model structure and a loss function training.
7. A fusion sound source positioning and voiceprint recognition method according to claim 1 is characterized by comprising the steps of inputting a deep fusion voiceprint feature sequence of a voice to be recognized into a registered user feature library, outputting final judgment probability distribution vectors P= [ P 1 ,p 2 ,…,p S ] in the registered user feature library, taking a user tag corresponding to the maximum judgment probability as a recognition result, judging as an unknown identity if the maximum judgment probability is smaller than a preset threshold, then calculating a variation coefficient of a fundamental tone frequency, judging as living voice if CV is larger than or equal to 0.05, otherwise judging as fake voice, and directly rejecting the identity matching result.
8. The method for positioning and identifying sound tracks by fusing sound sources according to claim 1, wherein the process for calculating the spatial coordinates of the sound sources comprises obtaining a reference audio acquisition module with the number of 0, constructing a spatial coordinate system by taking the reference audio acquisition module as an origin, obtaining the spatial coordinates of each audio acquisition module, calculating the correlation function of the audio acquisition module and the reference audio acquisition module in a preset number, calculating the correlation peak position by the maximum similarity of peak position corresponding signals of the correlation function K 0j , obtaining the initial position by the time difference of arrival of waves at the target audio acquisition module and the reference audio acquisition module, obtaining the initial position by fitting t 0j of all the target audio acquisition modules by a least square method, dividing a space scanning grid formed by wave beams, dividing each candidate position (theta', And r') calculating the controllable response power value, and searching the maximum value of the controllable response power value in a random area to serve as the space coordinate of the sound source.
9. A chip for performing the fusion sound source localization and voiceprint recognition method of any one of claims 1-8.
10. An electronic device, characterized in that, the electronic device comprising the chip of claim 9.

Description

Fusion sound source positioning and voiceprint recognition method, chip and electronic equipment Technical Field The invention relates to acoustic signal processing and biological recognition technology, in particular to a fusion sound source positioning and voiceprint recognition method, a chip and electronic equipment. Background With the development of the internet of things and artificial intelligence technology, the demands for perception and recognition based on acoustic signals are increasing, and voiceprint recognition is widely applied in respective fields. The voiceprint recognition relies on unique physiological and behavioral characteristics (such as vocal cord structure, pronunciation habit and the like) of an individual during sounding, realizes 1:1 identity verification or 1:N identity recognition, has the advantages of non-contact and high convenience, and plays an important role in the scenes of financial authentication, entrance guard attendance and the like. In the existing scheme, the voiceprint recognition module is only focused on identity verification, under the conditions of multiple sound sources and complex environments, the problems of identity and position dislocation, unknown sound source misjudgment and the like are easy to occur, for example, in an intelligent security scene, if the interference is too strong, personnel identity cannot be confirmed, position monitoring cannot be carried out on the person, the monitoring and judging efficiency is low, and complete security early warning closed loop is difficult to form. Disclosure of Invention In order to solve the technical problems, the invention provides a fusion sound source positioning and voiceprint recognition method, which is based on multichannel audio signals collected by an audio collection module array, realizes the collaborative execution of sound production person identification and sound source position monitoring, and comprises the following specific steps: S1, synchronously acquiring environmental audio signals through an audio acquisition module array and processing the environmental audio signals to obtain audio data. S2, analyzing the audio data, extracting acoustic features and carrying out feature fusion to obtain depth fusion voiceprint features. S3, comparing the depth fusion voiceprint features with a registered user feature library, judging whether the registered user feature library contains the depth fusion voiceprint features, if so, outputting the identity of the sounder, and if not, marking and giving an identification number of the sounder identity. S4, positioning the sound source based on the audio data to obtain the space coordinates of the sound source. S5, an identity-position association model is established, the identity information output by voiceprint recognition is bound with the space coordinates obtained by sound source positioning, and identity-position association data are generated. The processing method comprises the following steps of carrying out time sequence calibration on each audio acquisition module, eliminating calculated deviation values caused by clock deviation, calibrating time of each audio acquisition module through the deviation values, uniformly sending synchronous sampling triggering instructions, starting audio acquisition at the same time point by all the audio acquisition modules, calculating time differences corresponding to cross correlation peak values of any two channel signals, requiring the difference to be less than or equal to 1 mu s, otherwise, carrying out clock calibration again, removing background noise and power supply interference, filtering low-frequency noise, high-frequency noise and power supply interference in the environment, retaining an effective frequency band of a voice signal, carrying out compensation voice-frequency attenuation on the filtered audio signal, converting continuous signals into short-time stable frames through framing operation, windowing and restraining frequency spectrum leakage of FFT, identifying a starting point and an ending point of the voice signal, eliminating voice-free mute frames, and outputting an effective voice frame sequence as audio data. The preferred method for calculating the deviation value comprises analyzing the synchronous signals of standard time and 1PPS (1 pulse per second), wherein the time precision of the 1PPS signal is less than or equal to 1 μs, calibrating the local clock of each audio acquisition module by taking the 1PPS signal as a reference, and calculating the deviation value between the local clock and the standard timeWhere t' is the local sampling trigger time and t 0 is the standard time. Preferably, the audio signal compensation method comprises the following steps: In the embodiment, alpha=0.97 is taken, i is the number of the audio acquisition module, n is the index number of the discrete sampling point, f s is the sampling frequency of the audio signal, P ni is the pre-empha