CN-120544597-B - Multi-sound event detection positioning method and device based on neural network model

CN120544597BCN 120544597 BCN120544597 BCN 120544597BCN-120544597-B

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for detecting and positioning multiple sound events based on a neural network model. The method comprises the steps of innovatively designing a time-frequency multi-scale residual convolution block, forming a network model with Conformer modules and cross stitch unit modules, extracting characteristics in a multi-scale mode, strengthening long sequence modeling, promoting task collaborative optimization, improving performance and accuracy, improving characteristic quality in a pre-emphasis mode by framing and windowing in the aspect of data processing, improving data diversity in audio channel exchange and frequency spectrum enhancement, reducing overfitting, enhancing characteristic expression by adopting SALSA-Lite characteristics, and flexibly adjusting super parameters by means of verification sets by taking the multi-element loss function into consideration of task requirement acceleration convergence in a training strategy. The method has high training efficiency and excellent practical performance, has strong generalization capability on unknown data, can accurately cope with complex and changeable practical scenes, and effectively overcomes the defects of the traditional method.

Inventors

SONG RUIZHUO
WEN KEXIN
XIA LINA

Assignees

北京科技大学

Dates

Publication Date: 20260505
Application Date: 20250627

Claims (9)

1. A method for detecting and locating multiple sound events based on a neural network model, the method comprising: S1, constructing an audio data set, and splitting the audio data set into a training set, a verification set and a test set; S2, constructing a multi-sound source detection and localization neural network model, inputting SALSA-Lite characteristics of a training set into the multi-sound source detection and localization neural network model, and training sound detection and azimuth estimation; a multi-sound source detection and localization neural network model comprising: A time-frequency multi-scale residual convolution block, conformer modules and a cross stitch unit module for mutual learning of detection and positioning tasks, wherein the time-frequency multi-scale residual convolution block and Conformer modules are designed for multi-sound source detection and positioning; The time-frequency multi-scale residual convolution comprises a multi-scale convolution layer, a ReLU activation function, a common convolution layer and a batch normalization layer, wherein the multi-scale convolution layer combines a time axis and a frequency axis, a downsampling module is arranged at the same time and is used for processing the condition that the number of input and output channels or the step length are inconsistent, and finally, the output and residual branches are added and output through the ReLU activation function; s3, training a multi-sound source detection and positioning neural network model through a binary cross entropy loss function, a mean square error loss function and a joint loss function, wherein the joint loss function carries out weighted combination on the binary cross entropy loss function and the mean square error loss function; S4, predicting SALSA-Lite characteristics of the verification set by using the trained multi-sound source detection and positioning neural network model to obtain a sound detection result and an azimuth estimation result; according to the evaluation result of the verification set, the super parameters of the model are adjusted to obtain a tuned multi-sound source detection and positioning neural network model; S5, predicting SALSA-Lite characteristics of the test set through the tuned multi-sound source detection and localization neural network model to obtain a final sound detection result and a final azimuth estimation result.
2. The method of claim 1, wherein preprocessing the audio data in the audio data set to obtain SALSA-Lite characteristics of the audio data in the audio data set, comprises: Using audio channel exchange to audio data of training set to expand training set; And pre-emphasis, framing and windowing are carried out on the audio data in the audio data set, and SALSA-Lite characteristics are generated.
3. The method of claim 2, wherein pre-emphasis is used on all audio data in the data set, comprising: ; Wherein, the Is the signal after pre-emphasis and is then applied, Is the original input signal which is then used to generate the signal, Is the value of the input signal at the previous instant, Is the pre-emphasis coefficient.
4. A method according to claim 3, characterized in that during training, the input training set SALSA-Lite features are data enhanced using time masking and frequency masking.
5. The method of claim 4, wherein training the multiple sound source detection and localization neural network model by a binary cross entropy loss function, a mean square error loss function, and a joint loss function, the joint loss function weighted combining the binary cross entropy loss function and the mean square error loss function, comprises: Defining a joint loss function: ; Wherein, the Representing the binary cross entropy loss function used for sound event detection, Represents the mean square error loss function used for the sound source position estimation, Is the weight.
6. A neural network model-based multi-sound event detection positioning device for implementing the neural network model-based multi-sound event detection positioning method according to any one of claims 1 to 5, the device comprising: The characteristic extraction module is used for constructing an audio data set, splitting the audio data set into a training set, a verification set and a test set, preprocessing the audio data in the audio data set, and obtaining SALSA-Lite characteristics of the audio data in the audio data set; the training module is used for constructing a multi-sound source detection and localization neural network model, inputting SALSA-Lite characteristics of a training set into the multi-sound source detection and localization neural network model, and carrying out training of sound detection and azimuth estimation; the weighting module is used for training the multi-sound source detection and positioning neural network model through a binary cross entropy loss function, a mean square error loss function and a joint loss function, and the joint loss function carries out weighted combination on the binary cross entropy loss function and the mean square error loss function; The model tuning module is used for predicting SALSA-Lite characteristics of the verification set by using the trained multi-sound-source detection and localization neural network model to obtain a sound detection result and an azimuth estimation result; And the prediction module is used for predicting SALSA-Lite characteristics of the test set through the tuned multi-sound source detection and localization neural network model to obtain a final sound detection result and a final azimuth estimation result.
7. The apparatus of claim 6, wherein the feature extraction module is configured to augment the training set using an audio channel exchange on audio data of the training set; And pre-emphasis, framing and windowing are carried out on the audio data in the audio data set, and SALSA-Lite characteristics are generated.
8. A neural network model-based multi-sound event detection localization apparatus, the neural network model-based multi-sound event detection localization apparatus comprising: a processor, a memory having stored thereon computer readable instructions which, when executed by the processor, implement any of the neural network model-based multi-sound event detection localization methods of any of claims 1-5.
9. A computer readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement any of the neural network model-based multi-sound event detection localization methods of any of claims 1-5.

Description

Multi-sound event detection positioning method and device based on neural network model Technical Field The invention relates to the technical field of artificial intelligence, in particular to a method and a device for detecting and positioning multiple sound events based on a neural network model. Background For intelligent related industries, sound event detection and localization (Sound Event Localization and Detection, SELD) technology is one of the core technologies. Whether the intelligent conference system realizes accurate voice recognition and sound source tracking or the fields of intelligent home, intelligent security and the like, the effective detection and accurate positioning of sound events are not separated. The localization estimation method of the sound source can be classified into a parameterization method and a deep neural network-based method. However, the traditional parameterization methods such as beam forming have differences in algorithm complexity, array geometric constraint, acoustic scene modeling assumption and the like, and have the defects of high calculation complexity, high requirement on signal-to-noise ratio, poor real-time performance and the like. The method based on the deep neural network has robustness to the reverberation and the scene with low signal to noise ratio, and can automatically extract more representative features from the original audio by virtue of strong feature learning capability, accurately capture the essence of sound and improve the detection and positioning accuracy. Meanwhile, through a large amount of data training, the system can process complex scenes such as multiple sound sources, noise interference and the like. To solve the problem of multi-sound source detection and localization, researchers have proposed a series of methods based on deep neural networks, where convolutional neural networks are used to process sound features, and cyclic convolutional networks are most commonly used to capture combinations of timing features. Document [1] proposes a method of co-learning against the problem of joint localization and detection of a plurality of overlapping sound events. First, the method takes multi-channel audio as input, and extracts phase and amplitude spectrograms from each audio channel as features. Secondly, the feature sequence is mapped into two outputs in parallel by utilizing a common convolution cyclic neural network, firstly, sound event detection is used as a multi-label classification task to judge the category of the sound event in each frame, and secondly, three-dimensional Cartesian coordinates of the sound event are estimated through a multi-output regression task, so that positioning is realized. And finally, carrying out threshold processing on the network output to obtain a final result. The ACCDOA method proposed in document [2] uses a single loss function, avoiding the problem of balancing the target. First, the sound event is represented by associating the active state of the sound event with the Cartesian DOA vector length, the active state being represented by the vector length, and DOA being represented by the vector direction. Next, the CRNN architecture is used as an embedded network to extract the input audio features, followed by a full connection layer to estimate ACCDOA vectors. Finally, using the mean square error as a loss function, only the loss of activity is calculated in the absence of events. However, in the document [1], all tasks share one set of parameters, and when the optimal setting of the parameters is greatly different from one task to another, it is difficult to satisfy the requirements of all tasks at the same time, resulting in performance degradation. The literature [2] only trains based on the position information and takes ACCDOA vector amplitude as SED activation, which can lead to great reduction of accuracy in event type identification due to neglecting key features of sound events, and is difficult to distinguish similar positions or different events which occur successively at the same position in a complex scene, so that positioning and detection accuracy are affected. Therefore, how to construct a neural network algorithm for solving the detection and localization of multiple sound events is one of the urgent problems of those skilled in the art. Disclosure of Invention In order to solve the technical problem of how to construct a neural network algorithm for detecting and positioning multiple sound events in the prior art, the embodiment of the invention provides a method and a device for detecting and positioning multiple sound events based on a neural network model. The technical scheme is as follows: in one aspect, a method for detecting and positioning multiple sound events based on a neural network model is provided, which is characterized in that the method comprises the following steps: S1, constructing an audio data set, and splitting the audio data set into a trainin