CN-117612557-B - Multi-sound source positioning and detecting method based on global-local characteristic recalibration

CN117612557BCN 117612557 BCN117612557 BCN 117612557BCN-117612557-B

Abstract

The invention discloses a multi-sound source localization and detection method based on global-local feature recalibration, which comprises the steps of calculating short-time Fourier transform of a multi-channel spatial audio signal in a first-order stereo format, obtaining log linear frequency spectrum and normalized sound intensity vector as input features, performing data augmentation on features of a training set, splicing the augmented frequency spectrum and the sound intensity vector as input of a neural network model, training the neural network model to obtain optimal network model parameters, storing the optimal network model parameters, preprocessing a sample to be tested, sending the sample to a trained model, outputting the model to obtain predicted sound event category and position information, drawing a sound event detection diagram, a direction angle and an azimuth angle track graph according to a prediction result, comparing the sound event detection diagram, the direction angle and the azimuth angle track graph with a visual image of a real label of the test sample, and analyzing the performance of the model. The invention can achieve higher sound source positioning and detection performance, and the model shows better generalization on both real and synthetic data sets.

Inventors

HU YING
Ma Mengqin
HUANG HAO
HE LIANG

Assignees

新疆大学

Dates

Publication Date: 20260512
Application Date: 20231124

Claims (7)

1. A multi-sound source localization and detection method based on global-local feature recalibration, comprising: calculating short-time Fourier transform of a multichannel spatial audio signal in a first-order stereo format, and performing data augmentation on the characteristics of a training set after log linear frequency spectrum and normalized sound intensity vectors are obtained as input characteristics; splicing the amplified frequency spectrum and the sound intensity vector to serve as input of a neural network model, training the neural network model, obtaining optimal network model parameters and storing the optimal network model parameters; preprocessing a sample to be tested, sending the sample to be tested into a trained model, outputting a predicted result, drawing a sound event detection diagram, a direction angle and azimuth angle track graph according to the predicted result, comparing the sound event detection diagram, the direction angle and the azimuth angle track graph with a visual image of a real label of the sample to be tested, and analyzing the performance of the model; splicing the amplified frequency spectrum and the sound intensity vector to be used as the input of a neural network model, wherein the training process of the neural network model comprises the following steps: The method comprises the steps of taking the characteristics of data augmentation as the input of a neural network model, wherein the input characteristics are obtained by splicing log linear frequency spectrums and normalized sound intensity vectors and are provided with seven channels; After the input features are subjected to preliminary processing by the Encoder module, advanced features containing global and local information are obtained in parallel by a global-local feature extractor; Based on the advanced features, emphasizing key components in the feature map along multiple dimensions by a feature recalibration module to obtain recalibrated fine features; The neural network model finally outputs the predicted sound event category and the position information of the sound source along the sound event detection and sound source positioning branches respectively, adopts a joint optimization strategy, weights and calculates the loss functions of the sound event detection branches and the sound source positioning branches, correspondingly obtains binary cross entropy loss and mean square error, linearly combines the binary cross entropy loss and the mean square error as a final loss function, and updates network parameters according to the final loss function; After the input features are subjected to preliminary processing by the Encoder module, the process of acquiring the advanced features by the global-local feature extractor comprises the following steps: The method comprises the steps of sending the amplified input features into a network, firstly, performing preliminary treatment on the features by an encoder structure for shallow feature extraction, wherein the encoder structure consists of two convolution layers and an average pooling operation with a kernel size of 2x2, each convolution layer comprises a convolution neural network with a convolution kernel size of 3x3, a batch normalization layer and a Gaussian error linear unit activation function, and each convolution layer in the encoder is not connected with residual errors; And the global-local feature extractor comprises a trunk branch consisting of an omnidirectional dynamic convolution and a multi-scale feature extraction module and a local feature extraction unit, global features are extracted through the trunk branch, local features are extracted by the local feature extraction unit, and finally, the global features and the local features are selectively fused through an attention feature unit.
2. The global-local feature recalibration based multi-sound source localization and detection method according to claim 1, wherein, Each first-order stereo signal of the multi-channel spatial audio signal in the first-order stereo format comprises four channels (W, X, Y, Z), wherein W is 0-order spherical harmonic for acquiring omnidirectional information, X, Y, Z is 1-order spherical harmonic, and spatial information is transmitted along a Cartesian coordinate system of a sound field; The log linear spectrum of the spatial audio signal is obtained by calculating a complex spectrum X (t, f), and the formula expression is as follows: the normalized sound intensity vector has the expression: wherein, the sound intensity vector is used for transmitting valuable information along the sound propagation direction, the reverse direction of the sound intensity vector is interpreted as the direction of arrival, and the formula expression of the sound intensity vector is as follows: ρ 0 and c represent density and sound velocity respectively, Representing the real part of the complex number, Representing conjugation.
3. The global-local feature recalibration based multi-sound source localization and detection method according to claim 1, wherein, The data augmentation of the features of the training set comprises the steps of continuously using audio channel exchange, random clipping and frequency displacement to carry out data augmentation on each training sample on the premise of not increasing the data quantity; The audio channel exchange is a space augmentation method designed for a data set collected by a spherical microphone, the direction response of a first-order stereo format audio signal is expressed by a cosine function, the direction of arrival represents the space position information of a sound event, the angle deformation of the direction of arrival is expressed based on the cosine function, 16 direction of arrival combinations are obtained by using a rotation matrix for each direction of arrival, the combination comprises an original direction of arrival and 15 new combinations, and one combination is randomly selected as the new direction of arrival of each sample; The random clipping method is to randomly select a rectangular area on a spectrogram, set values in the area as random values in the value range of each channel in a linear frequency spectrum, set values in the rectangular area as 0 for each channel of a sound intensity vector, and share one mask masking for all channels, which is similar to masking operation; the frequency translation is a random upward or downward translation of a particular frequency band along the frequency dimension of all channels of the input feature.
4. The global-local feature recalibration based multi-sound source localization and detection method according to claim 1, wherein the process of deriving recalibrated fine features by the feature recalibration module emphasizing key components in the feature map along multiple dimensions based on the advanced features comprises, The feature recalibration module calculates attention along the channel, time and frequency dimensions for emphasizing key channels, time frames and frequency bands of the features related to the sound source, and sends the obtained fine features to the sound event detection branch and the arrival direction estimation branch respectively.
5. The global-local feature recalibration based multi-sound source localization and detection method according to claim 1, wherein, The global-local feature extractor comprises an omnidirectional dynamic convolution, a multi-scale feature extraction module, a local feature extraction unit and an attention feature fusion unit; the trans-scale shuffling unit in the multi-scale feature extraction module is used for increasing information exchange among the multi-scale features; the asymmetric convolution in the local feature extraction unit is used for extracting fine granularity features; the neural network model also includes replacing the common convolution with the group convolution in the multi-scale feature extraction module and adding a residual structure to alleviate the overfitting problem.
6. The global-local feature recalibration based multi-sound source localization and detection method according to claim 5, wherein, The omnidirectional dynamic convolution introduces a multidimensional attention mechanism, adopts a parallel strategy, learns different attentions of a convolution kernel along the space dimension, the input channel dimension, the output channel dimension and the convolution kernel dimension of a kernel space, and sequentially multiplies the different attentions by the convolution kernel omega i according to the sequence of the position, the channel, the filter and the kernel; The formula expression of the omnidirectional dynamic convolution is as follows: Wherein, the The amount of attention is represented as a scalar of attention, , And Three attention weights respectively representing the calculation of the convolution kernel Wi along the space in the convolution kernel space, and the dimension of the input channel and the dimension of the output channel are calculated by a compression-expansion module.
7. The global-local feature recalibration based multi-sound source localization and detection method according to claim 5, wherein, The process by which the trans-scale shuffling unit increases the exchange of information between multi-scale features includes, Channel shuffling operations, modeled as a "morph-transpose-morph" process, facilitate information flow between multi-scale feature maps given the dimensions of the input features # Firstly, it is deformed into% G s ,n/g s ) further transposed to% N/g s ,g s ), and finally deforming back to the original shape, wherein n and g s respectively represent the channel number and the grouping size, and then aggregating the shuffled features with the original features through an aggregation block, wherein the aggregation block comprises two CNNs with the core size of 1x1, the first CNN is used for reducing the channel number, and the second CNN is used for further fusing the feature graphs of different channels, and preserving the information on the positions of the original channels.

Description

Multi-sound source positioning and detecting method based on global-local characteristic recalibration Technical Field The invention belongs to the field of multi-sound source positioning and detection, and particularly relates to a multi-sound source positioning and detection method based on global-local feature recalibration. Background Sound source localization and detection may be regarded as a joint task of sound source localization and sound event detection. In particular, sound source localization and detection systems need to predict boundaries of sound events in an active state, identify their categories, and at the same time provide a spatial trajectory of the sound source. In recent years, this task has become increasingly popular and is helpful in many aspects of everyday applications. For example, the robot can better complete man-machine interaction with the aid of a sound source positioning and detecting system, the sound source positioning and detecting task can cooperate with a voice enhancement task to perform noise reduction processing on the sound of a specific speaker by capturing the position of the specific speaker in an intelligent conference room, and the robot can also be applied to real-time environment sound monitoring of a smart city. The research methods for sound source localization and detection tasks are mainly divided into traditional parameterization-based methods and deep neural network-based methods. Several popular conventional parameterization methods are sound source localization (TDOA) based on time of arrival differences, multiple signal classification (MUSIC), steering corresponding power (SPR) based, rotational invariance (ESPRIT) based, etc. In recent years, with the deep study of deep learning methods such as neural networks, structures such as convolutional neural networks and cyclic neural networks are greatly colored in the sound source positioning and detection fields. Classical model SELDnet based on CRNN (convolutional recurrent neural network) is widely accepted. As the ratio of overlapping sound events in a dataset increases, improving localization and detection accuracy in complex sound environments is an urgent need for multiple sound source localization and detection tasks. Disclosure of Invention In order to realize multi-sound source detection and localization under the condition that a plurality of overlapped sound events exist in a sound fragment, the invention provides a multi-sound source localization and detection method based on global-local characteristic recalibration, which comprises the following steps: calculating short-time Fourier transform of a multichannel spatial audio signal in a first-order stereo format, and performing data augmentation on the characteristics of a training set after log linear frequency spectrum and normalized sound intensity vectors are obtained as input characteristics; splicing the amplified frequency spectrum and the sound intensity vector to serve as input of a neural network model, training the neural network model, obtaining optimal network model parameters and storing the optimal network model parameters; And preprocessing a sample to be tested, sending the sample to be tested into a trained model, outputting and obtaining a prediction result, drawing a sound event detection diagram, a direction angle and azimuth angle track graph according to the prediction result, comparing the sound event detection diagram, the direction angle and the azimuth angle track graph with a visual image of a real label of the sample to be tested, and analyzing the performance of the model. Preferably, each first order stereo signal of the multi-channel spatial audio signal in the first order stereo format comprises four channels (W, X, Y, Z), wherein W is a 0 th order spherical harmonic for obtaining omnidirectional information, X, Y, Z is a1 st order spherical harmonic, and the spatial information is transferred along a Cartesian coordinate system of a sound field; The log linear spectrum of the spatial audio signal is obtained by calculating a complex spectrum X (t, f), and the formula expression is as follows: the normalized sound intensity vector has the expression: wherein, the sound intensity vector is used for transmitting valuable information along the sound propagation direction, the reverse direction of the sound intensity vector is interpreted as the direction of arrival, and the formula expression of the sound intensity vector is as follows: ρ 0 and c represent density and sound velocity respectively, Representing the real part of the complex number,Representing conjugation. Preferably, the data augmentation of the features of the training set includes data augmentation of each training sample continuously using audio channel switching, random clipping and frequency displacement without increasing the amount of data; The audio channel exchange is a space augmentation method designed for a data set collected by a sp