CN-122024764-A - Sound event detection and positioning method and system based on sound scene conditioning

CN122024764ACN 122024764 ACN122024764 ACN 122024764ACN-122024764-A

Abstract

The invention discloses a sound event detection and positioning method and system based on sound scene conditioning. The method comprises the steps of collecting multichannel audio signals of an environment in real time, classifying sound scenes, obtaining context information comprising sound scene results and real-time context characteristics, dynamically determining detection sensitivity aiming at different sound event types based on the context information, inputting the audio signals into a detection and positioning model to obtain the activation intensity of each event, judging the activation intensity according to the dynamic sensitivity, and outputting the type and azimuth information of active events. According to the invention, by introducing a dual-conditioning mechanism of the sound scene and the real-time context, the system can intelligently adjust the detection strategy according to the environment, so that the accuracy, reliability and environmental adaptability of key sound event detection in a complex and changeable environment are obviously improved, and meanwhile, the lower calculation cost is kept, and the method is suitable for intelligent wearable and household monitoring and other edge calculation scenes.

Inventors

LI SHIYUE
RUI XIANYI
KANG JIAN

Assignees

苏州大学

Dates

Publication Date: 20260512
Application Date: 20260210

Claims (10)

1. A sound event detection and positioning method based on sound scene conditioning is characterized by comprising the following steps: S1, acquiring multichannel audio signals of a current environment in real time; s2, carrying out sound scene classification on the audio signal to obtain a sound scene classification result of the current environment, wherein the sound scene classification result at least comprises three scenes of home, nature and city; step S3, obtaining context information, wherein the context information comprises the sound scene classification result and/or real-time context characteristics; step S4, dynamically determining detection sensitivity for different types of sound events based on the context information; s5, inputting the audio signals into a sound event detection and positioning model to obtain the activation intensity of each sound event category; And S6, judging the activation intensity based on the detection sensitivity so as to output the category of the active sound event and the arrival direction information thereof.
2. The sound scene conditioning based sound event detecting and locating method according to claim 1, wherein the method of dynamically determining detection sensitivities for different categories of sound events based on the context information comprises: determining a basic detection threshold corresponding to the current sound scene and the target sound event category based on the sound scene classification result; Calculating to obtain dynamic sensitivity adjustment based on the real-time context characteristics; And determining a dynamic detection threshold for each sound event category according to the basic detection threshold and the dynamic sensitivity adjustment quantity, wherein the dynamic detection threshold is used for representing the detection sensitivity.
3. The sound event detecting and positioning method based on sound scene conditioning according to claim 2, wherein the parameters involved in the determination of the dynamic detection threshold are determined by a parameter optimization procedure, the parameter optimization procedure comprising: defining a parameter set for determining the basic detection threshold and the dynamic sensitivity adjustment amount as a parameter set to be optimized; Taking the comprehensive performance index of the integrated system comprising sound scene classification, sound event detection and positioning and dynamic detection threshold determination on the verification set as an objective function; searching the parameter set to be optimized by adopting an automatic parameter searching algorithm in a preset parameter space; selecting a parameter search result which maximizes the objective function as a determined parameter set; wherein the determined set of parameters is used to generate the dynamic detection threshold.
4. The sound event detecting and positioning method based on sound scene conditioning according to claim 3, wherein the parameter optimizing process further comprises adaptively adjusting the determined parameter set through an online learning algorithm based on user feedback data after system deployment.
5. The sound event detection and localization method based on sound scene conditioning of claim 2 wherein the real-time contextual features comprise at least one of temporal contextual features, acoustic contextual features, historical event contextual features.
6. The sound event detecting and positioning method based on sound scene conditioning according to claim 2, wherein the method for calculating the dynamic sensitivity adjustment amount based on the real-time context features is as follows: quantifying the at least one real-time context feature into a corresponding numerical factor; And fusing the numerical factors through a preset fusion function to obtain the dynamic sensitivity adjustment quantity.
7. The sound event detecting and locating method based on sound scene conditioning according to claim 2, wherein the method for judging the activation intensity based on the detection sensitivity comprises: comparing the activation intensity of the target sound event category with the corresponding dynamic detection threshold; and if the activation intensity is larger than the dynamic detection threshold value, judging that the target sound event category is active.
8. The sound event detection and localization method based on sound scene conditioning of claim 1, wherein the performing sound scene classification on the audio signal to obtain a sound scene classification result comprises: extracting the characteristics of the audio signal to obtain spectrum characteristics; and inputting the frequency spectrum characteristics into a pre-trained sound scene classification network model to obtain the sound scene classification result, wherein the sound scene classification network model is a lightweight neural network model.
9. The sound event detecting and locating method based on sound scene conditioning according to claim 1, wherein the audio signal is input into a sound event detecting and locating model to obtain the activation intensity of each sound event category, comprising: The sound event detection and positioning model is a double-branch neural network model, and the audio signal is subjected to space acoustic feature extraction to obtain a combined feature fused with amplitude and phase information; And inputting the combined features into the sound event detection and positioning model for processing to obtain the activation intensity and the corresponding spatial direction information of each sound event category.
10. A sound event detection and localization system based on sound scene conditioning, comprising the following modules: the acquisition module is used for acquiring multichannel audio signals of the current environment in real time; The classification module is used for classifying the sound scene of the audio signal to obtain a sound scene classification result of the current environment, wherein the sound scene classification result at least comprises three scenes of home, nature and city; The context module is used for acquiring context information, wherein the context information comprises the sound scene classification result and/or real-time context characteristics; the decision module is used for dynamically determining detection sensitivity aiming at different types of sound events based on the context information; the detection and positioning module is used for inputting the audio signals into the sound event detection and positioning model to obtain the activation intensity of each sound event category; and the output module is used for judging the activation intensity based on the detection sensitivity so as to output the category of the active sound event and the arrival direction information thereof.

Description

Sound event detection and positioning method and system based on sound scene conditioning Technical Field The invention relates to the technical field of audio signal processing, artificial intelligence and edge calculation, in particular to a sound event detection and positioning method and system based on sound scene conditioning. Background The core objective of the sound event detection and localization (SELD) technology is to give the machine the ability to perceive and understand the acoustic activity of the environment, and by virtue of the characteristic of accurate recognition and position judgment of the environment sound, the technology has shown wide application prospects and practical values in multiple fields of intelligent monitoring, man-machine interaction, intelligent wearable equipment and the like. The conventional SELD system generally adopts a unified sensing strategy, and is characterized in that the system carries out voice event recognition and positioning work by using a fixed and unchanged detection standard no matter how the acoustic characteristics of the environment where the user is located are changed, and typically, global activation thresholds are uniformly configured for all voice event categories, and the detection flow is executed indiscriminately. However, such a solution lacking environmental context awareness has a number of inherent drawbacks, in particular as follows: The system has the advantages that the environment suitability is poor, the detection reliability is insufficient, the system perception sensitivity is insufficient due to the unified and fixed detection standard in a noisy environment, weak target sound events which are critical to scenes such as safety protection are difficult to capture, key event report missing is easy to cause, the system is over-sensitive and frequently reports non-key irrelevant sounds by mistake in a quiet environment, normal use of a user is seriously disturbed, and the overall reliability of the system is greatly reduced. Secondly, the system has no dynamic attention allocation capability and low resource utilization rate, and the existing system can not realize an adaptive adjustment mechanism similar to human hearing, and can not dynamically allocate hearing attention according to the current environmental acoustic state, so that a large amount of calculation resources are consumed in invalid detection and processing of non-key sound events, and resource waste is caused. Thirdly, the practical requirements are difficult to meet, in scenes with strict constraints on real-time performance and power consumption, such as intelligent wearable equipment, core engineering contradiction is faced when a context-aware SELD system is constructed, the acoustic scene classifier which is light enough and quick in response is configured to adapt to real-time operation requirements and low power consumption constraints of equipment, the classifier is required to have high detection precision and strong environmental robustness, reliable basis can be provided for follow-up scene-based differentiated detection decisions, and the existing scheme cannot consider the two mutually constrained core targets. And the fourth one is lack of a complete cooperative scheme and poor module coupling, namely, the technical scheme capable of systematically optimizing the mutual constraint targets in a cooperative manner does not appear in the prior art, and meanwhile, a complete technical path capable of realizing high-efficiency and low-delay coupling connection with a downstream SELD core module is also lacked, so that the floor application of a situation-aware SELD system cannot be supported. In summary, the existing SELD technology lacks an intelligent sensing capability of an environmental context and a dynamic detection policy optimization mechanism, so that it is difficult to consider the reliability and accuracy of detection in a complex and changeable actual environment, and there is an urgent need in the art for a SELD technology that can sense the environmental context intelligently and dynamically optimize the detection policy based on environmental features, so as to realize highly reliable and high-precision environmental sound sensing in a complex environment. Disclosure of Invention Therefore, the invention aims to overcome the problems that the existing sound event detection and positioning (SELD) system adopts fixed detection standards, lacks environment context sensing capability, has poor environment adaptability, false alarm of key events or false alarm of non-key events and low efficiency of resource allocation, and is difficult to consider the light weight, the rapidness and the high-precision robustness of an acoustic scene classifier on embedded equipment with limited resources, and can not realize the sound sensing with high reliability and high precision under complex environments, thereby providing the sound event detection and p