CN-121999331-A - Multi-mode data acquisition and fusion method and device, electronic equipment and medium

CN121999331ACN 121999331 ACN121999331 ACN 121999331ACN-121999331-A

Abstract

The application relates to the technical field of acquisition and fusion of multi-mode data, and discloses a method, a device, electronic equipment and a medium for acquiring and fusing the multi-mode data, wherein the method comprises the following steps: accurate alignment of acoustic and visual data in a time domain is ensured through time stamp synchronization, a foundation is laid for feature level fusion, visual signal quality is dynamically evaluated by utilizing a definition detection model, fusion weights of acoustic and visual features are adaptively distributed according to the visual signal quality, and the most reliable data source in a current environment can be furthest utilized by the dynamic weight adjustment strategy, so that the fusion features are obtained. The method has the beneficial effects that the robustness and the accuracy of multi-modal feature fusion are remarkably improved, so that a higher-quality data basis is provided for the subsequent behavior recognition of the large yellow croaker, and the inherent defect of unreliable single-modal perception in an underwater environment is effectively overcome by innovatively introducing a self-adaptive weighted fusion mechanism based on visual definition.

Inventors

LI JUNQI
ZHU HANHAO
TANG YUNFENG
ZHOU ZIHAO
WANG ZHUO
SU ZHENG
HOU XIAOBO
ZHANG ZHIYAO
ZHOU HUIJUN
FU CHENGYU

Assignees

浙江海洋大学

Dates

Publication Date: 20260508
Application Date: 20251225

Claims (10)

1. A method for collecting and fusing multi-modal data, the method comprising: Collecting an original acoustic signal and an original visual signal of a designated large yellow croaker in a target water area; Performing time stamp synchronization on the original acoustic signal and the original visual signal to obtain a synchronized target acoustic signal and a synchronized target visual signal; Extracting target acoustic features of the target acoustic signal and extracting target visual features of the target visual signal; detecting the definition of the original visual signal through a preset definition detection model, and calculating the acoustic signal-to-noise ratio of the original acoustic signal through a preset calculation mode; Dynamically calculating the visual weight of the target visual feature and the acoustic weight of the target acoustic feature by using an attention mechanism, wherein the attention mechanism takes the definition and the acoustic signal-to-noise ratio as input; And carrying out weighted fusion on the target visual characteristics and the target acoustic characteristics according to the visual weights and the acoustic weights to obtain fusion characteristics of the specified large yellow croaker.
2. The method of claim 1, wherein in the steps of extracting the target acoustic features of the target acoustic signal and extracting the target visual features of the target visual signal, the step of extracting the target visual features of the target visual signal includes: Transmitting the target visual signal to a preset key point detection model, and outputting pixel coordinates of the key point in an image; Acquiring motion trail data of the key points through a tracking algorithm based on pixel coordinates of the key points; and constructing the target visual features based on the motion trail data.
3. The method of claim 2, wherein the step of constructing the target visual feature based on the motion trajectory data comprises: Calculating a speed vector, an acceleration vector and a movement direction angle of the key point by adopting an optical flow method based on the movement track data of the key point; calculating the relative distance and the relative angle change between different key points; and combining the speed vector, the acceleration vector, the movement direction angle, the relative distance and the relative angle change to construct and form the target visual characteristic.
4. The method for collecting and fusing multi-modal data according to claim 1, wherein the step of synchronizing the time stamps of the original acoustic signal and the original visual signal to obtain a synchronized target acoustic signal and target visual signal includes: Acquiring original timestamp information of the original acoustic signal and the original visual signal, wherein the original timestamp information is system time of the original acoustic signal and the original visual signal during acquisition; calculating the time difference between the original acoustic signal and the original visual signal, and determining time deviation information; And correcting the time stamps of the original acoustic signals and the original visual signals based on the time deviation information to obtain a target acoustic signal and a target visual signal after synchronization, wherein the target acoustic signal and the target visual signal have corresponding time references under the same time axis.
5. The method of claim 1, wherein in the steps of extracting the target acoustic features of the target acoustic signal and extracting the target visual features of the target visual signal, the step of extracting the target acoustic features of the target acoustic signal includes: extracting sound source position features, frequency spectrum features and energy features in the target acoustic signals; identifying a composite sound signal in the target acoustic signal based on the sound source location feature, the spectral feature, and the energy feature; Filtering the environmental noise signals in the comprehensive sound signals by using an adaptive filtering algorithm to obtain pure acoustic signals of the designated large yellow croaker; and carrying out wavelet transformation on the pure acoustic signals, and extracting time-frequency domain features so as to obtain the target acoustic features.
6. The method for collecting and fusing multimodal data as claimed in claim 1, wherein before the step of collecting the original acoustic signal and the original visual signal of the specified large yellow croaker in the target water, further comprising: Obtaining individual swimming ranges of each large yellow croaker in a target water area; according to the individual swimming ranges of all the large yellow croakers, carrying out space overall planning on the swimming ranges of all the large yellow croakers, and determining overall swimming range information covering the individual swimming ranges of all the large yellow croakers; and according to the total swimming range, determining the number, the installation position and the spacing of the sound sensors in combination with the effective coverage range of each sound sensor so as to acquire the original acoustic signals of the designated large yellow croaker in the target water by the sound sensors.
7. The method for collecting and fusing multimodal data according to claim 1, wherein the step of weighting and fusing the target visual features and the target acoustic features according to the visual weights and the acoustic weights to obtain the fusion features of the specified large yellow croaker further comprises: Inputting the fusion characteristics into a preset behavior recognition model, and outputting the behavior type of the specified large yellow croaker; Judging the corresponding behavior grade according to the identified behavior type, wherein the behavior grade at least comprises a normal grade, an early warning grade and an alarm grade; and based on the judged behavior grade, controlling preset equipment in the aquaculture water area in a linkage mode to execute intervention operation matched with the behavior grade.
8. A multi-modal data acquisition and fusion device, the device comprising: The acquisition module is used for acquiring an original acoustic signal and an original visual signal of the designated large yellow croaker in the target water area; The synchronization module is used for performing timestamp synchronization on the original acoustic signal and the original visual signal to obtain a target acoustic signal and a target visual signal after synchronization; The extraction module is used for extracting target acoustic characteristics of the target acoustic signals and extracting target visual characteristics of the target visual signals; The detection module is used for detecting the definition of the original visual signal through a preset definition detection model and calculating the acoustic signal to noise ratio of the original acoustic signal through a preset calculation mode; the setting module is used for dynamically calculating the visual weight of the target visual feature and the acoustic weight of the target acoustic feature by using an attention mechanism, wherein the attention mechanism takes the definition and the acoustic signal-to-noise ratio as input; and the fusion module is used for carrying out weighted fusion on the target visual characteristics and the target acoustic characteristics according to the visual weights and the acoustic weights to obtain fusion characteristics of the designated large yellow croaker.
9. A computer readable storage medium, characterized in that a computer program is stored, which, when being executed by a processor, causes the processor to perform the steps of the multi-modality data acquisition and fusion method according to any one of claims 1 to 7.
10. An electronic device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the multi-modal data collection and fusion method of any one of claims 1 to 7.

Description

Multi-mode data acquisition and fusion method and device, electronic equipment and medium Technical Field The present invention relates to the field of multi-mode data acquisition and fusion technologies, and in particular, to a method, an apparatus, an electronic device, and a medium for multi-mode data acquisition and fusion. Background The large yellow croaker is taken as an important economic fish species in China, the fine culture and behavior monitoring of the large yellow croaker are important for improving culture benefits, early warning diseases and reducing losses, at present, the monitoring of behaviors of underwater fishes mainly depends on a single visual or acoustic analysis technology, the visual analysis can intuitively acquire the gestures and the motion trail of the fishes, but in a target water area, factors such as low water visibility, light change, algae shielding and the like seriously influence the image quality, so that feature extraction is incomplete or even fails, the acoustic analysis judges through collecting sounding or reflected sound waves of the fishes, the influence of optical conditions of the water is small, but signals of the fishes are easily interfered by environmental noise and overlapping sounding of the fishes, and the accuracy is limited in complex behavior recognition. The existing multi-mode technology tries to combine visual and acoustic data, but usually adopts simple characteristic splicing or fixed weight for fusion, and the characteristic of dynamic fluctuation of two signal quality in an underwater environment is not fully considered, so that the stiff fusion mode can compromise the performance of an integral recognition model when the reliability of a signal source is suddenly reduced due to environmental interference, and the real complementation of the advantages of the two modes can not be realized. Disclosure of Invention Based on this, it is necessary to provide a method, a device, an electronic device and a medium for acquiring and fusing multi-modal data, aiming at the existing problem of acquiring and fusing multi-modal data. A method of multi-modal data acquisition and fusion, the method comprising: Collecting an original acoustic signal and an original visual signal of a designated large yellow croaker in a target water area; Performing time stamp synchronization on the original acoustic signal and the original visual signal to obtain a synchronized target acoustic signal and a synchronized target visual signal; Extracting target acoustic features of the target acoustic signal and extracting target visual features of the target visual signal; detecting the definition of the original visual signal through a preset definition detection model, and calculating the acoustic signal-to-noise ratio of the original acoustic signal through a preset calculation mode; Dynamically calculating the visual weight of the target visual feature and the acoustic weight of the target acoustic feature by using an attention mechanism, wherein the attention mechanism takes the definition and the acoustic signal-to-noise ratio as input; And carrying out weighted fusion on the target visual characteristics and the target acoustic characteristics according to the visual weights and the acoustic weights to obtain fusion characteristics of the specified large yellow croaker. Further, in the step of extracting the target acoustic feature of the target acoustic signal and the target visual feature of the target visual signal, the step of extracting the target visual feature of the target visual signal includes: Transmitting the target visual signal to a preset key point detection model, and outputting pixel coordinates of the key point in an image; Acquiring motion trail data of the key points through a tracking algorithm based on pixel coordinates of the key points; and constructing the target visual features based on the motion trail data. Further, the step of constructing the target visual feature based on the motion trail data includes: Calculating a speed vector, an acceleration vector and a movement direction angle of the key point by adopting an optical flow method based on the movement track data of the key point; calculating the relative distance and the relative angle change between different key points; and combining the speed vector, the acceleration vector, the movement direction angle, the relative distance and the relative angle change to construct and form the target visual characteristic. Further, the step of performing timestamp synchronization on the original acoustic signal and the original visual signal to obtain a synchronized target acoustic signal and target visual signal includes: Acquiring original timestamp information of the original acoustic signal and the original visual signal, wherein the original timestamp information is system time of the original acoustic signal and the original visual signal during acquisition; calculating the time differen