CN-121656968-B - All-weather sound source positioning system and method based on multi-sensor fusion

CN121656968BCN 121656968 BCN121656968 BCN 121656968BCN-121656968-B

Abstract

The invention discloses an all-weather sound source positioning system and method based on multi-sensor fusion, which comprises the steps of firstly obtaining acoustic sensing data, millimeter wave radar data and infrared thermal imaging data, wherein each data is provided with an acquisition time stamp, then conducting cross-modal time alignment processing on the multi-source data to map the multi-source data to a unified time reference to obtain multi-modal fusion characteristics under a unified time axis, conducting self-adaptive weighted fusion on the data of different modes according to the confidence of each sensor data at each time point of the time alignment multi-modal fusion characteristics, generating fused public characteristic representations, and finally conducting time sequence modeling and joint reasoning based on the public characteristic representations of a plurality of continuous time points to regress continuous position information of a sound source in a three-dimensional space. The method realizes high-precision positioning of three-dimensional positions of a plurality of sound sources, thereby improving the accuracy and stability of sound source positioning under complex environment and all-weather conditions.

Inventors

HUANG JIYE
ZHANG YUZHE
Wang zhedong
GAO MINGYU
CAO ZUYANG
CHEN PING
HOU PEIPEI
ZHANG ZEXIN
ZHANG XIN

Assignees

杭州电子科技大学

Dates

Publication Date: 20260508
Application Date: 20260206

Claims (12)

1. An all-weather sound source positioning method based on multi-sensor fusion is characterized by comprising the following steps: step 1, acquiring acoustic sensing data from a sound sensor, millimeter wave radar data from a millimeter wave radar and infrared thermal imaging data from an infrared thermal imaging sensor, wherein each data is provided with an acquisition time stamp; Step 2, performing cross-modal time alignment processing on the acoustic sensing data, the millimeter wave radar data and the infrared thermal imaging data to map the data to a unified time reference, so as to obtain multi-modal fusion characteristics under a unified time axis; the cross-modal time alignment processing method comprises the following steps: normalizing the characteristics of the data of each mode; combining the normalized features with the corresponding time codes and mode identification codes to form a mode feature vector; Defining a unified time axis, taking a time point on the unified time axis as a query, and calculating soft alignment weights from each mode feature vector to the query time point by using a cross attention mechanism; Carrying out weighted aggregation on each modal feature vector based on the soft alignment weight to obtain multi-modal fusion features under a unified time axis; Step 3, aiming at each time point in the time-aligned multi-mode fusion characteristics, carrying out self-adaptive weighted fusion on the data of different modes according to the confidence level of each sensor data, and generating a fused public characteristic representation; And 4, carrying out time sequence modeling and joint reasoning based on the common characteristic representations of a plurality of continuous time points, and regressing continuous position information of the sound source in the three-dimensional space.
2. The method of claim 1, wherein the acoustic sensing data comprises sound source direction information, the millimeter wave radar data comprises distance or angle information, and the infrared thermal imaging data comprises spatial region information.
3. The method of claim 1, wherein the sound sensor is a microphone array of 6 microphones, the microphone array being an regular octahedral array structure.
4. The method according to claim 1, wherein in the step 2, the multimodal fusion feature is obtained using a cross-modal time alignment transducer model, and the cross-modal time alignment transducer model is trained by the following method: Inputting the multi-mode fusion characteristics into a time alignment prediction network to obtain prediction time on a unified time axis; constructing a time alignment loss function based on the time alignment prediction result; And updating network parameters by adopting a back propagation algorithm to obtain a trained cross-modal time alignment transducer model.
5. The method of claim 1, wherein adaptively weighted fusing the data of the different modalities according to the confidence level of each sensor data to generate a fused common feature representation, comprising: calculating confidence scalar for acoustic direction information, radar distance or angle information and infrared space region information respectively according to each time point, wherein the confidence of the acoustic direction information is calculated based on the signal-to-noise ratio and the space spectrum characteristics of the acoustic direction information, the confidence of millimeter wave radar data is calculated based on the peak power of a detected target and the target existence probability of the detected target, and the confidence of the infrared thermal imaging data is calculated based on the temperature contrast between a target region and a background and the size of the target region; mapping the confidence scalar into a confidence vector, and fusing the confidence vector with time alignment features of the corresponding modes; inputting each mode characteristic fused with the confidence information and a learnable fusion identifier to a transducer encoder together for processing; And extracting a vector corresponding to the fusion identifier from the output of the transducer encoder as a common characteristic representation of the time point.
6. The method according to claim 5, wherein in the step3, in the process of generating the common feature representation, the method further comprises the step of constructing a fusion loss function to supervise the fusion process, wherein the fusion loss function at least comprises a cross-modal consistency loss function for constraining consistency between the common feature representation and each modal fusion feature, a confidence weighted consistency function for constraining consistency between the common feature representation and each modal fusion feature, and an event continuity loss function; And updating the parameters of the transducer encoder by adopting a back propagation algorithm to obtain the trained transducer encoder.
7. The method of claim 1, wherein performing time series modeling and joint reasoning based on the common feature representation at a plurality of continuous time points, regressing continuous position information of the sound source in the three-dimensional space, comprises: superimposing a sequential position code on a sequence composed of common feature representations at a plurality of points in time; modeling the time sequence dependency relationship of the feature sequence overlapped with the sequential position codes by using a transducer encoder to obtain a time dimension attention weight matrix; And mapping the time-dimension attention weight matrix subjected to time sequence modeling into three-dimensional space coordinates through a regression network to obtain sound source position estimation corresponding to each time point.
8. The method of claim 7, wherein in the regression of the sound source position information, the constructed loss function includes at least a spatial position regression loss term for measuring a position estimation error and a trajectory smoothing loss term for constraining the smoothness of the adjacent time point position estimation.
9. All-weather sound source positioning device based on multisensor fuses, characterized by comprising: The data acquisition module is used for acquiring sound source direction information from the sound sensor, target distance or angle information from the millimeter wave radar and target space region information from the infrared thermal imaging sensor, wherein each information is provided with an acquisition time stamp; the time alignment module is used for performing cross-modal time alignment processing on the direction information, the distance or the angle information and the space region information from different sensors so as to map the cross-modal time alignment processing to a unified time reference and obtain time aligned multi-modal fusion characteristics; The time alignment module is specifically configured to: normalizing the characteristics of the data of each mode; combining the normalized features with the corresponding time codes and mode identification codes to form a mode feature vector; Defining a unified time axis, taking a time point on the unified time axis as a query, and calculating soft alignment weights from each mode feature vector to the query time point by using a cross attention mechanism; carrying out weighted aggregation on the modal feature vectors based on the soft alignment weight to obtain multi-modal fusion features at each query time point; The fusion module is used for carrying out self-adaptive weighted fusion on the data of different modes according to the confidence level of each sensor data aiming at each time point in the time-aligned multi-mode fusion characteristics to generate a fused public characteristic representation; And the positioning reasoning module is used for carrying out time sequence modeling and joint reasoning based on the common characteristic representations of a plurality of continuous time points and regressing continuous position information of the sound source in the three-dimensional space.
10. The apparatus of claim 9, wherein the fusion module is specifically configured to: Calculating confidence scalar for acoustic direction information, radar distance or angle information and infrared space region information respectively according to each time point, wherein the confidence of the acoustic direction information is calculated based on the signal-to-noise ratio and the space spectrum characteristics of the acoustic direction information, the confidence of the radar information is calculated based on the peak power of a detected target and the target existence probability of the detected target, and the confidence of the infrared information is calculated based on the temperature contrast between a target region and a background and the size of the target region; mapping the confidence scalar into a confidence vector, and fusing the confidence vector with time alignment features of the corresponding modes; inputting each mode characteristic fused with the confidence information and a learnable fusion identifier to a transducer encoder together for processing; And extracting a vector corresponding to the fusion identifier from the output of the transducer encoder as a common characteristic representation of the time point.
11. The apparatus of claim 9, wherein the positioning inference module is specifically configured to: superimposing a sequential position code on a sequence composed of common feature representations at a plurality of points in time; Modeling the time sequence dependency relationship of the feature sequence overlapped with the sequential position codes by using a transducer encoder; and mapping the time sequence modeling features into three-dimensional space coordinates through a regression network to obtain sound source position estimation corresponding to each time point.
12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the all-weather sound source localization method based on multi-sensor fusion as claimed in any one of claims 1 to 8.

Description

All-weather sound source positioning system and method based on multi-sensor fusion Technical Field The invention relates to the technical field of acoustic signal processing, in particular to an all-weather sound source positioning system and method based on multi-sensor fusion. Background In conventional sound source localization techniques, a single acoustic sensor (e.g., a microphone array) is typically relied upon to acquire sound source information. However, in environments with low signal-to-noise ratio, complex background noise and non-line-of-sight, the performance of the acoustic sensor is greatly affected by multipath effects, environmental interference, noise pollution and other factors, so that the accuracy and stability of sound source positioning are difficult to meet the actual demands. In order to improve positioning accuracy and system robustness, the prior art adopts a multi-sensor fusion method. Studies have shown that acoustic sensors (microphone arrays) and visual sensors (cameras, etc.) are complemented and fused with information using conventional fusion techniques. The sound source signal is comprehensively estimated through visual information and acoustic information, so that the sound source positioning accuracy can be remarkably improved. However, most multi-mode methods still face the problems that (1) application scenes are limited, the performance of a visual sensor is greatly affected under low-light or strong-light conditions, and the traditional visual sensor is difficult to provide effective target information under dark environments or extreme light conditions. (2) The method comprises the steps of (1) time synchronization and space alignment of multiple sensors, namely, because of different sampling frequencies of the sensors and inconsistent data time stamps, an effective time alignment mechanism is lacked in the data fusion process of different sensors, and (3) the signal fusion is lacked in adaptability, namely, the traditional fusion method is usually realized through simple weighting or regular superposition, and the different signal quality and reliability of each sensor cannot be fully considered, so that the fusion result is influenced by low-quality sensor data. Therefore, the multi-sensor fusion-based all-weather sound source positioning method is high in efficiency and robustness, the challenges faced in the prior art can be effectively solved, and particularly in environments with complex, dynamic and low signal to noise ratio, high-precision sound source positioning is guaranteed. Disclosure of Invention The invention aims to overcome the defects of the prior art, and provides an all-weather sound source positioning system and method based on multi-sensor fusion, which constructs unified time and space characterization by fusing multi-mode information of a sound sensor, a millimeter wave radar and an infrared thermal imaging sensor, the method has the advantages that high-precision positioning of three-dimensional positions of a plurality of sound sources is realized, the confidence level of the sensor is utilized to carry out self-adaptive weighting, and the time series modeling is combined to carry out joint reasoning on multi-time data, so that the accuracy and stability of sound source positioning are improved under complex environments and all-weather conditions. In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: an all-weather sound source positioning method based on multi-sensor fusion comprises the following steps: step 1, acquiring acoustic sensing data from a sound sensor, millimeter wave radar data from a millimeter wave radar and infrared thermal imaging data from an infrared thermal imaging sensor, wherein each data is provided with an acquisition time stamp; Step 2, performing cross-modal time alignment processing on the acoustic sensing data, the millimeter wave radar data and the infrared thermal imaging data to map the data to a unified time reference, so as to obtain multi-modal fusion characteristics under a unified time axis; Step 3, aiming at each time point in the time-aligned multi-mode fusion characteristics, carrying out self-adaptive weighted fusion on the data of different modes according to the confidence level of each sensor data, and generating a fused public characteristic representation; And 4, carrying out time sequence modeling and joint reasoning based on the common characteristic representations of a plurality of continuous time points, and regressing continuous position information of the sound source in the three-dimensional space. Preferably, the acoustic sensing data includes sound source direction information, the millimeter wave radar data includes distance or angle information, and the infrared thermal imaging data includes spatial region information. Preferably, the sound sensor is a microphone array formed by 6 microphones, and the microphone array has an regular oc