CN-122015838-A - Cross-modal audio-visual navigation method for guiding auditory attention and aligning space
Abstract
The invention discloses a cross-mode audio-visual navigation method and system with auditory attention guidance and spatial alignment and electronic equipment, and belongs to the technical field of artificial intelligence and robots. In order to solve the problems of spatial information loss and inconsistent intermodal characterization caused by later feature fusion in the existing audio-visual navigation method, the invention provides 1) a spatial fusion network for guiding auditory attention, which extracts visual and auditory two-dimensional feature maps with a reserved spatial structure at an intermediate layer of an encoder, generates the aligned auditory feature maps into spatial attention maps, applies the spatial attention maps to visual feature maps in a mode of element-by-element multiplication and residual connection to realize spatial guidance and depth fusion of visual perception by sound clues, and 2) an alignment loss function for cross-modal spatial distribution, which adopts an asymmetric 'teacher-student' supervision strategy, takes the visual spatial distribution as a static 'teacher' signal, unidirectional guides the study of 'student' auditory spatial distribution by KL divergence so as to force the spatial understanding of the two modes to be consistent and avoid gradient conflict in training. According to the invention, the depth interaction and alignment are carried out in the space dimension, so that the positioning accuracy, the path efficiency and the robustness of navigation are obviously improved.
Inventors
- YU YINFENG
- WU SHAOHANG
Assignees
- 新疆大学
Dates
- Publication Date
- 20260512
- Application Date
- 20251203
Claims (4)
- 1. A method for cross-modal audio visual navigation of auditory attention is characterized by comprising the steps of obtaining visual observation data and auditory observation data provided by an intelligent body sensor, processing the visual observation data through a visual encoder to extract a two-dimensional visual feature map with a space structure at the middle layer of the visual encoder, processing the auditory observation data through the auditory encoder to extract a two-dimensional auditory feature map at the middle layer of the auditory encoder, generating a single-channel auditory spatial attention map according to the auditory feature map, carrying out spatial weighting on the visual feature map by utilizing the auditory spatial attention map to generate a weighted visual feature map, fusing the weighted visual feature map with the visual feature map to generate a fused spatial feature map, generating a comprehensive state representation according to the fused spatial feature map and high-level semantic features extracted from the visual encoder and the auditory encoder, and inputting the comprehensive state representation into a time sequence model to output actions for controlling intelligent body navigation.
- 2. The method of claim 1, wherein said step of fusing said weighted visual feature map with said visual feature map is performed by adding said visual feature map to said weighted visual feature map via a residual connection.
- 3. The method of claim 1, further comprising the step of optimizing the visual encoder and the auditory encoder during model training using a cross-modal spatial distribution alignment loss function.
- 4. The method of claim 4, wherein the step of optimizing using a cross-modal spatial distribution alignment loss function comprises processing the visual feature map into a first spatial probability distribution and processing the auditory feature map into a second spatial probability distribution, using an asymmetric supervision strategy, using the first spatial probability distribution as a static teacher signal, constructing the loss function by calculating a KL divergence between the second spatial probability distribution and the first spatial probability distribution, and updating only gradients to the auditory encoder when back-propagated.
Description
Cross-modal audio-visual navigation method for guiding auditory attention and aligning space Technical Field The invention belongs to the technical field of artificial intelligence and robots, and particularly relates to a method and a system for autonomous navigation of an intelligent body (Embodied Agent) in a three-dimensional (3D) environment. More specifically, the invention discloses a cross-modal navigation technology for performing target sound source localization and path planning by fusing auditory and visual information, and particularly relates to a navigation solution which utilizes a deep learning model, guides visual perception through auditory spatial attention and forcedly realizes multi-modal spatial characterization consistency. Background Navigating an agent to a specific target in an unknown environment is a core task. When a human performs such tasks, various sensory information is comprehensively utilized, wherein vision is used for understanding environment layout and identifying obstacles, and hearing is capable of sensing and positioning targets out of the sight range, and is particularly important for sound source positioning. Therefore, developing an intelligent body with audio-visual navigation capability has important significance for improving the autonomy and efficiency of the intelligent body in a complex real environment. The current mainstream audio-visual navigation method generally adopts a deep reinforcement learning framework, and its general paradigm is that visual input (such as RGB image or depth map) is processed through a visual encoder (such as convolutional neural network CNN), and auditory input (such as binaural spectrogram) is processed through an auditory encoder. The two encoders extract high-level semantic features respectively, then fuse (e.g., splice) the feature vector layers, and finally send the fused single feature vector into a cyclic neural network (RNN) to integrate time sequence information, and generate navigation actions according to the time sequence information. However, the above strategies based on feature level post-fusion have the disadvantage that they largely ignore the inherent relevance of both visual and auditory modalities in the spatial dimension. Auditory signals, particularly binaural audio, naturally contain rich spatial directional cues (e.g. binaural time differences, intensity differences, etc.), which are key physical basis for determining the direction of sound sources. In a conventional independent encoding process, this valuable spatial information is largely compressed and even completely lost through multi-layer convolution and pooling, especially when ultimately "flattened" into one-dimensional feature vectors. The loss of the spatial information prevents the model from effectively performing cross-modal spatial interaction and calibration in the middle stage of feature coding, thereby limiting the positioning accuracy and path efficiency of navigation and the robustness in a sparse or vanishing scene of sound clues. Therefore, a technical scheme is urgently needed, and the audio-visual fusion and collaboration of a deeper level can be realized on the basis of reserving and utilizing the multi-mode space information. Disclosure of Invention The invention aims to overcome performance bottleneck caused by space information loss and inconsistent modal characterization in the existing audio-visual navigation technology, and provides a cross-modal audio-visual navigation method and system capable of realizing more accurate, efficient and robust navigation and auditory attention guidance and space alignment. To achieve the above object, the present invention discloses a technical solution, whose core includes an innovative auditory attention-directed spatial fusion network and a specially designed cross-modal spatial distribution alignment loss function. The method is implemented by a deep learning model that receives visual and auditory sensor inputs from the agent and outputs navigational actions. Referring to fig. 1 and 2, the method and system framework proposed by the present invention comprises the following main modules and steps: Cross-modal spatial feature extraction and alignment, namely visual feature extraction, namely inputting visual observation data (such as RGB images or depth maps) acquired by an agent into a visual encoder, and extracting a two-dimensional visual feature map F v after the last convolution layer and before flattening layers of the encoder, wherein the dimension of the visual feature map F v is (B, C, H and W), B is the batch size, C is the channel number, and H and W are the spatial height and width. Auditory feature extraction, the input of auditory observation data (e.g., binaural spectrogram) to a separate auditory encoder, also extracts a two-dimensional auditory feature map F a, whose dimensions are (B, C ', H ', W '), before flattening. Spatial alignment, since the spatial dimensions (