CN-121995412-A - Multi-source fusion deep learning visual navigation method, system, equipment and storage medium

CN121995412ACN 121995412 ACN121995412 ACN 121995412ACN-121995412-A

Abstract

The invention discloses a multi-source fusion deep learning visual navigation method, a system, equipment and a storage medium, which comprise the steps of collecting a visual image sequence and Beidou positioning signals acquired by a robot in the running process, and preprocessing to generate a fusion input sequence with quality scores; the method comprises the steps of carrying out binary channel feature coding on a visual image sequence and Beidou positioning signals, mapping features to a unified joint representation space, carrying out dynamic modeling on the reliability of each mode in the current environment, generating a binary branch dynamic mask of vision and Beidou based on the confidence coefficient and cross-mode correlation of each mode, combining confidence coefficient modulation vectors, fusing the two types of features to obtain a fused feature sequence, carrying out context modeling on the fused feature sequence, constructing a joint optimization objective function, carrying out cooperative training, and outputting a robot navigation track. The invention can improve the navigation precision, reliability and autonomy of the robot in a complex environment.

Inventors

CHEN ZERUI
GAO ZHENGHAO
HE PEILIN
CHEN JIMENG
LI HANGFENG
Hu Houpeng
OU JIAXIANG
WU XIN
YANG SHANG
XIAO YANHONG
XIAO JIAN
WANG NAN

Assignees

贵州电网有限责任公司

Dates

Publication Date: 20260508
Application Date: 20251218

Claims (10)

1. A multi-source fusion deep learning visual navigation method, comprising: Acquiring a visual image sequence and a Beidou positioning signal which are acquired by a robot in the operation process, and preprocessing the visual image sequence and the Beidou positioning signal to generate a fusion input sequence with quality scores; Performing double-channel feature coding on the visual image sequence and the Beidou positioning signal, mapping visual features and Beidou features to a unified joint representation space, and dynamically modeling the reliability of each mode in the current environment through a confidence score network; Based on the confidence coefficient of each mode and the cross-mode correlation, generating a double-branch dynamic mask of vision and Beidou, combining confidence coefficient modulation vectors, and fusing two types of features through a dynamic reconstruction operator to obtain a fused feature sequence; And carrying out context modeling on the fusion characteristic sequence, carrying out global consistency constraint weighted by Beidou confidence coefficient, constructing a joint optimization objective function to carry out cooperative training by combining track smoothness and modal consistency loss, and outputting a robot navigation track.
2. The method of multi-source fusion for deep learning visual navigation according to claim 1, wherein the preprocessing of the visual image sequence and the Beidou positioning signal to generate the fusion input sequence with quality scores comprises the steps of respectively carrying out field standardization processing on the visual image sequence and the Beidou positioning signal, calculating an optimal time delay based on a cross-correlation function between the visual signal and the Beidou positioning signal, and resampling and interpolating the Beidou positioning signal according to the optimal time delay, and aligning the Beidou positioning signal and the visual image sequence on a unified time axis; And carrying out feature extraction and denoising treatment on the standardized and synchronized visual image sequence and the Beidou positioning signal to obtain visual features and Beidou features, wherein the visual features are obtained by extracting space geometric and time sequence dynamic information from continuous image frames through a convolutional neural network and a time sequence modeling operator, and the Beidou features are obtained by splicing the Beidou features with quality indexes representing signal quality after denoising the Beidou three-dimensional coordinate sequence.
3. The multi-source fusion deep learning visual navigation method of claim 2, further comprising performing anomaly detection and missing completion on the visual features and Beidou features respectively, judging to be anomalous and rejected when the residual error between the Beidou coordinates and the smooth estimated value exceeds a preset threshold value, performing interpolation completion based on weighted similarity of adjacent frames in a time neighborhood when the visual features are missing, and exponentially attenuating weights along with the time distance; And splicing the visual features subjected to the exception processing with the Beidou features to form a fusion feature vector, constructing a quality scoring function based on the visual confidence coefficient, the Beidou confidence coefficient and the time alignment error, calculating a quality score corresponding to the fusion feature vector at each moment, normalizing the quality score to a (0, 1) interval through a Sigmoid function, and outputting a fusion input sequence consisting of the fusion feature vector and the corresponding quality score.
4. The multi-source fusion deep learning visual navigation method of claim 3, wherein generating a dual-branch dynamic mask of vision and Beidou based on confidence level and cross-modal correlation of each mode, combining confidence level modulation vectors, fusing two types of features through a dynamic reconstruction operator, and obtaining a fused feature sequence comprises: Extracting spatial features from each frame of image in the visual image sequence through a convolutional neural network, and inputting the spatial features in a continuous time window into a time sequence modeling operator to generate visual features; Smoothing and filtering the three-dimensional coordinate sequence in the Beidou positioning signal, splicing the three-dimensional coordinate sequence with quality indexes representing signal quality, inputting the spliced three-dimensional coordinate sequence into a multi-layer perceptron encoder, and generating Beidou characteristics; Dynamically calculating the visual confidence coefficient and the Beidou confidence coefficient under the current environment based on the visual features and the Beidou features through a first confidence coefficient scoring network and a second confidence coefficient scoring network respectively; according to the visual confidence coefficient and the Beidou confidence coefficient, generating an overall quality score of the fusion feature through weighted combination and normalization of a Sigmoid function, and representing the reliability of each mode at the current moment; And obtaining a fusion characteristic input sequence and a corresponding overall quality score.
5. The multi-source fusion deep learning visual navigation method of claim 4, wherein generating a dual-branch dynamic mask of vision and beidou based on confidence levels and cross-modal correlations of each modality comprises: Based on the enhanced visual features and the Beidou features after feature encoding, the enhanced visual features are enhanced in stability by introducing position constraint residuals between continuous frames; Based on the enhanced visual features and the Beidou features, a visual mask vector and a Beidou mask vector are calculated respectively, the enhanced visual features and the Beidou features are input in a combined mode through a first learning mapping, the visual mask vector is generated through Sigmoid activation, the Beidou features and the enhanced visual features are input in a combined mode through a second learning mapping, and the Beidou mask vector is generated through Sigmoid activation.
6. The multi-source fusion deep learning visual navigation method of claim 5, wherein combining the confidence coefficient modulation vector, fusing the two types of features through a dynamic reconstruction operator to obtain a fusion feature sequence comprises the steps of obtaining the visual confidence coefficient and the Beidou confidence coefficient at the current moment, and converting the scalar confidence coefficient into the visual confidence coefficient modulation vector and the Beidou confidence coefficient modulation vector through corresponding nonlinear mapping functions; The confidence modulation vector is multiplied with the corresponding mask vector element by element to obtain modulated visual characteristics and modulated Beidou characteristics, the modulated two types of characteristics are input into a dynamic reconstruction operator for nonlinear fusion, the dynamic reconstruction operator comprises a visual characteristic linear transformation term, a Beidou characteristic linear transformation term and an interaction term of element-by-element products of the visual characteristic linear transformation term and the Beidou characteristic linear transformation term, and a fusion intermediate representation is output; performing dimension reduction and regularization mapping on the fusion intermediate representation to generate a final fusion feature vector; And traversing to process the data at each moment to obtain a fusion characteristic sequence.
7. The multi-source fusion deep learning visual navigation method of claim 6, wherein the context modeling is performed on the fusion feature sequence, global consistency constraint weighted by Beidou confidence coefficient is performed, and a joint optimization objective function is constructed for collaborative training by combining track smoothness and modal consistency loss, and the outputting of a robot navigation track comprises: The method comprises the steps of obtaining a fusion feature sequence, carrying out context enhancement on the fusion feature sequence by adopting a time sequence modeling operator based on a multi-head self-attention mechanism to obtain a context enhancement representation of each moment, inputting the context enhancement representation into a track decoder, predicting the position increment of a robot under a local coordinate system, recovering a navigation track in a recursive integration mode, and jointly optimizing the whole time sequence modeling and track deducing process by a multi-term loss function, wherein the process is represented as follows: Wherein, the For the reconstruction loss of track increments, For the purpose of the trajectory smoothing constraint, For the global consistency constraint of the Beidou assistance, In order for the confidence consistency constraint to be present, Is a super parameter used to adjust the relative weight of each partial loss.
8. A multi-source fusion deep learning visual navigation system for use in the method of any of claims 1-7, comprising: the multi-source data acquisition and preprocessing module is used for acquiring a visual image sequence and a Beidou positioning signal which are acquired by the robot in the operation process, preprocessing the visual image sequence and the Beidou positioning signal, and generating a fusion input sequence with quality scores; The feature coding and joint representation module is used for carrying out double-channel feature coding on the visual image sequence and the Beidou positioning signal, mapping visual features and Beidou features to a unified joint representation space, and carrying out dynamic modeling on the reliability of each mode in the current environment through a confidence scoring network; The fusion module is used for generating a double-branch dynamic mask of vision and Beidou based on the confidence coefficient of each mode and the cross-mode correlation, combining the confidence coefficient modulation vector, and fusing the two types of features through a dynamic reconstruction operator to obtain a fused feature sequence; And the time sequence modeling and joint optimization module is used for performing context modeling on the fusion characteristic sequence, performing global consistency constraint weighted by Beidou confidence coefficient, constructing a joint optimization objective function to perform collaborative training by combining track smoothness and modal consistency loss, and outputting a robot navigation track.
9. An electronic device, comprising: A memory and a processor; the memory is configured to store computer-executable instructions that, when executed by the processor, perform the steps of the multi-source fusion deep learning visual navigation method of any one of claims 1 to 7.
10. A computer readable storage medium storing computer executable instructions which when executed by a processor perform the steps of the multi-source fusion deep learning visual navigation method of any one of claims 1 to 7.

Description

Multi-source fusion deep learning visual navigation method, system, equipment and storage medium Technical Field The invention relates to the technical field of visual navigation, in particular to a multi-source fusion deep learning visual navigation method, a system, equipment and a storage medium. Background In the development process of the autonomous navigation technology of the intelligent robot, a visual navigation and deep learning method has become an important direction of research and application. The existing scheme mainly realizes environment sensing, path planning and obstacle avoidance control by constructing an end-to-end deep neural network or combining a convolutional neural network, a cyclic neural network and the like with a reinforcement learning method, and reduces the dependence on the traditional SLAM algorithm and an external positioning system to a certain extent. Part of the research also introduces positioning signals such as an inertial measurement unit, GPS/Beidou and the like to enhance the navigation robustness in a complex environment. However, in visual navigation applications incorporating deep learning, there is still a disadvantage in that firstly, there is insufficient adaptability to complex environments. Most of the existing visual navigation methods based on deep learning rely on indoor simulation level for training and testing, and the model has limited generalization capability under real complex environments such as illumination change, shielding, bad weather and the like, so that the navigation precision in outdoor or high-interference scenes is obviously reduced. Secondly, the high-precision positioning information such as Beidou is not utilized enough. Although some studies and patents have attempted to combine GNSS/beidou positioning with visual navigation, most approaches are still predominantly single-modality, lacking depth fusion strategies. In the prior art, beidou positioning is often only used as a rough global coordinate reference, and cannot be combined and optimized with visual characteristics, depth estimation and semantic information, so that the continuity and reliability of navigation are difficult to ensure in a weak satellite signal or multipath interference environment. Thirdly, the multi-source data fusion level is insufficient. In the prior art, in data processing, thought of visual dominance and other sensor assistance is generally adopted, so that inertial information, beidou positioning data, environment semantic information and the like have limited roles in navigation decisions, fusion is only stopped at a data level or a result level, and a depth fusion mechanism of a feature level and a decision level is lacked. This makes the system prone to problems of bias accumulation and navigation failure during critical tasks. Fourth, the system has limited coordination capability. The existing deep learning visual navigation is used as an independent module, and has the defects of cooperation with task planning, path optimization and security strategies. For example, when the Beidou positioning and visual estimation results are inconsistent, the system lacks a rapid collision detection and self-recovery mechanism, manual intervention is often required, and autonomy and robustness are reduced. In summary, although the prior art has advanced to some extent in visual perception and deep learning navigation, the problems of insufficient adaptability to complex environments, insufficient utilization of Beidou high-precision information, limited multi-source fusion level, insufficient collaborative capability of a system and the like generally exist, and the navigation requirements of a robot on high reliability, high precision and high autonomy in an actual complex environment cannot be met. Disclosure of Invention The present invention has been made in view of the above-described problems occurring in the prior art. Accordingly, the present invention provides a multi-source fusion method, system, apparatus, and storage medium for deep learning visual navigation that solves the problems mentioned in the background art. In order to solve the technical problems, the invention provides the following technical scheme: In a first aspect, an embodiment of the present invention provides a multi-source fusion depth learning visual navigation method, including collecting a visual image sequence and a beidou positioning signal acquired by a robot in an operation process, and preprocessing the visual image sequence and the beidou positioning signal to generate a fusion input sequence with quality scores; Performing double-channel feature coding on the visual image sequence and the Beidou positioning signal, mapping visual features and Beidou features to a unified joint representation space, and dynamically modeling the reliability of each mode in the current environment through a confidence score network; Based on the confidence coefficient of each mode and th