CN-121995309-A - Sound source positioning method based on NV-SRP neural network model

CN121995309ACN 121995309 ACN121995309 ACN 121995309ACN-121995309-A

Abstract

The invention discloses a sound source positioning method based on an NV-SRP neural network model, which relates to the technical field of sound source positioning, and mainly comprises the following steps of: and constructing a training and testing data set according to the acoustic vector sensor signal receiving model and the SRP-PHAT algorithm, training the sound source localization network to obtain a trained sound source localization network, and predicting the target sound source localization to obtain a prediction result. By implementing the sound source positioning method based on the NV-SRP neural network model, the sound source positioning precision and stability can be provided.

Inventors

HAO GUOCHENG
LIU CONG
GUO JUAN
Wang Zhekang
LI XIANGBO

Assignees

中国地质大学（武汉）

Dates

Publication Date: 20260508
Application Date: 20251203

Claims (10)

1. The sound source localization method based on the NV-SRP neural network model is characterized by comprising the following steps of: S1, constructing a training and testing data set according to an acoustic vector sensor signal receiving model and an SRP-PHAT algorithm; S2, constructing a sound source localization network, and training the sound source localization network by utilizing the training and testing data set to obtain a trained sound source localization network; And S3, predicting the target sound source location by using the trained sound source location network to obtain a prediction result.
2. The sound source localization method based on NV-SRP neural network model of claim 1, wherein the acoustic vector sensor signal receiving model is as follows: Wherein t is a time index; Is a 4-dimensional output vector of the acoustic vector sensor, and comprises 1-dimensional sound pressure signals And 3-dimensional particle velocity signal The acoustic vector sensor comprises an omnidirectional sound pressure sensor and three orthogonal dipole sensors, wherein I is the number of sound sources in an acoustic environment; is the source signal of the i-th sound source, A 4-dimensional room impulse response of the source signal for the i-th sound source; Is additive noise; And Is a short-time fourier transform domain representation of the received signal and the ith source signal, k is a time frame index, n is a frequency bin index; Representing modeling an Acoustic Transfer Function (ATF) between an ith source and each sensor of the AVS; representing modeling errors; is a short-time fourier transform domain representation of the noise signal; representing the direct sound path response of the ith sound source under far-field assumption; The discrete angular frequency, K is the FFT conversion length; representing the sampling delay of the direct path impulse response relative to the first sample of frame 0; is formed by azimuth angle And pitch angle The i-th sound source direction vector is determined.
3. The method of claim 1, wherein the sound source localization network comprises a pair network and a global decoder, such as the formula: , Wherein, the An output representing a sound source localization network; Representing a global decoder; M and l respectively represent the mth and the ith array element in the microphone array; representing generalized cross-correlation characteristics of microphone pairs (m, l); the sensor position coordinates corresponding to the microphone pair (m, l).
4. A sound source localization method based on NV-SRP neural network model according to claim 3, characterized in that the paired network is configured to: Processing generalized cross-correlation characteristics of each microphone pair by using three-layer two-dimensional convolution blocks, wherein the convolution kernel of each three-layer two-dimensional convolution block is fixed to be 1 in the time dimension size and does not carry out time dimension pooling, and the convolution step length is a unit value; And flattening the processed characteristics, extracting time sequence characteristics by using a bidirectional gating circulation unit, fusing sensor position coordinates corresponding to the microphone pair (m, l) with the time sequence characteristics, and converting the sensor position coordinates into space likelihood codes with uniform dimensions by using a multi-layer perceptron.
5. A sound source localization method based on NV-SRP neural network model according to claim 3, characterized in that the global decoder is configured to: Carrying out weighted summation on the space likelihood codes of all the microphone pairs to obtain a weighted summation result; According to the original multi-channel signals and the weighted summation result, obtaining fusion characteristics by utilizing an acoustic vector fusion module; And according to the fusion characteristics, the activity probability and the three-dimensional rectangular coordinates of each sound source are obtained by utilizing the activity detection branch and the position regression branch.
6. The NV-SRP neural network model based sound source localization method of claim 5, wherein the sound vector fusion module is configured to: separating three-dimensional velocity components from original multi-channel signal Calculating the three-dimensional velocity component Spatial average and velocity norms in the sensor dimension to obtain 4-dimensional statistical features ; According to the 4-dimensional statistical characteristics Deep velocity representation is extracted by using cascaded 1D convolution layers, and the velocity features are obtained by aligning the time dimension T through linear interpolation and projecting the time dimension T into a feature space ; By using a double-flow gating mechanism, the audio characteristic tensor is firstly performed And the velocity characteristics Respectively carrying out layer normalization treatment and splicing to obtain spliced characteristics ; According to the characteristics after the splicing Calculating a gating coefficient; And obtaining a final fusion characteristic according to the gating coefficient.
7. The sound source localization method based on NV-SRP neural network model of claim 1, wherein the loss function of the sound source localization network is: , Wherein, the As a loss function; the Euclidean localization error coefficient weighted for the sound source activity state; The target sound source activity state; a target sound source arrival direction matrix; Outputting a matrix for a network; Detecting a loss coefficient for the binary cross entropy activity; representing binary cross entropy; is the active state of the sound source.
8. The method for sound source localization based on the NV-SRP neural network model as recited in claim 1, further comprising evaluating the prediction using spatial deviation.
9. The sound source localization method based on the NV-SRP neural network model of claim 8, wherein the calculation formula of the spatial deviation is: Wherein, the Is a spatial deviation; 、、 And Representing the true azimuth, elevation, and predicted azimuth, elevation, respectively.
10. A computer program product comprising a computer program, characterized in that the computer program when executed by a processor implements the steps of the sound source localization method based on NV-SRP neural network model as claimed in any of claims 1-9.

Description

Sound source positioning method based on NV-SRP neural network model Technical Field The invention relates to the technical field of sound source localization, in particular to a sound source localization method based on an NV-SRP neural network model. Background Traditional sound source localization methods, such as Generalized Cross Correlation (GCC) algorithms based on time difference of arrival (TDOA), beam forming (SRP-phas) methods, and high resolution algorithms based on subspace (MUSIC, ESPRIT), are theoretically well established and perform well in ideal acoustic environments. However, these methods suffer from dramatic degradation in performance and insufficient robustness in the face of harsh acoustic environments such as low signal-to-noise ratio (SNR) and strong reverberation. Although deep learning techniques have been introduced to promote the robustness of sound source localization and have advanced by combining neural networks with conventional frameworks (e.g., neural-SRP), most of these approaches still rely on conventional microphone arrays that can only collect scalar sound pressure information. This physical limitation fundamentally limits the dimensions in which sound field information can be obtained. When the sound pressure signal is contaminated by severe reverberation and noise, the improvement space of the algorithm performance is still limited. How to improve positioning accuracy and stability under the environment of strong reverberation and high background noise is a technical problem to be solved. Disclosure of Invention The invention aims to provide a sound source positioning method based on an NV-SRP neural network model, which can provide sound source positioning precision and stability. The invention provides a sound source localization method based on an NV-SRP neural network model, which is characterized by comprising the following steps: S1, constructing a training and testing data set according to an acoustic vector sensor signal receiving model and an SRP-PHAT algorithm; S2, constructing a sound source localization network, and training the sound source localization network by utilizing the training and testing data set to obtain a trained sound source localization network; And S3, predicting the target sound source location by using the trained sound source location network to obtain a prediction result. The invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the above-described sound source localization method based on the NV-SRP neural network model. The sound source localization method based on the NV-SRP neural network model has the following beneficial effects: Aiming at the problem of insufficient positioning precision and robustness in a complex acoustic environment in the prior art, a training and testing data set is constructed according to an acoustic vector sensor signal receiving model and an SRP-PHAT algorithm, a sound source positioning network is trained to obtain a trained sound source positioning network, and target sound source positioning is predicted to obtain a prediction result, the method comprises a flexible multi-stage fusion framework, physical geometric information of a sensor array is explicitly integrated into a feature learning process, an original acoustic velocity vector is adaptively combined with a depth audio feature, and positioning accuracy and robustness of the model under challenging scenes such as strong reverberation and high noise are remarkably improved through cooperative utilization of three kinds of information including sound pressure, sound velocity and array geometry, so that a more effective and complete solution is provided for acoustic perception based on an acoustic vector sensor; the invention can be applied to the fields of robot hearing, voice recognition, intelligent conference systems, acoustic monitoring, immersive communication, unmanned aerial vehicle-mounted sensing and the like, which need to accurately position the sound source in space. Drawings The invention will be further described with reference to the accompanying drawings and examples, in which: FIG. 1 is a flow chart of a sound source localization method based on an NV-SRP neural network model provided by the invention; FIG. 2 is a flow chart of an NV-SRP algorithm provided by the present invention; FIG. 3 is a diagram of an NV-SRP network architecture provided by the present invention; FIG. 4 is a flow chart of a vector fusion module structure provided by the invention; FIG. 5 is a diagram of an NV-SRP module architecture provided by the present invention; Fig. 6 is a schematic diagram showing the effect of signal-to-noise ratio on positioning errors of different SRP variants at a reverberation time rt60=0.4 s provided by the present invention; Fig. 7 is a schematic diagram showing the effect of reverberation time and channel configuration on positioning e