CN-120333439-B - Visual navigation method and system based on denoising diffusion bridge model

CN120333439BCN 120333439 BCN120333439 BCN 120333439BCN-120333439-B

Abstract

The application discloses a visual navigation method and a system based on a denoising diffusion bridge model, wherein the method comprises the steps of obtaining initial image data; the method comprises the steps of performing feature coding processing on initial image data to generate a context vector, performing linear modulation processing on the context vector to generate a condition variable, generating prior action according to the context vector and a preset motion rule, inputting the prior action and the condition variable into a denoising diffusion bridge model to perform diffusion training to generate a target navigation action, wherein the prior action is used as an initial state of the denoising diffusion bridge model during reverse denoising processing. The application can effectively reduce redundant iteration in the diffusion process, reduce accumulated error, greatly improve the efficiency and stability of motion generation, and can be widely applied to the technical field of computers.

Inventors

REN HAO
ZENG YIMING
Bi Zetong
WAN ZHAOLIANG
HUANG JUNLONG
CHENG HUI

Assignees

中山大学

Dates

Publication Date: 20260508
Application Date: 20250403

Claims (10)

1. The visual navigation method based on the denoising diffusion bridge model is characterized by comprising the following steps of: Acquiring initial image data; Performing feature coding processing on the initial image data to generate a context vector; Performing linear modulation processing on the context vector to generate a condition variable; Generating prior actions according to the context vector and a preset motion rule; and inputting the prior action and the condition variable into a denoising diffusion bridge model for diffusion training to generate a target navigation action, wherein the prior action is used as an initial state of the denoising diffusion bridge model when inverse denoising is performed.
2. The method of claim 1, wherein the initial image data comprises an initial observation image and an initial target image, wherein the feature encoding the initial image data to generate a context vector comprises: performing feature extraction processing on the initial observation image to obtain observation image features; Performing feature extraction processing on the initial target image to obtain target image features; and carrying out feature fusion processing on the observed image features and the target image features to generate the context vector.
3. The method of claim 1, wherein the linear modulation of the context vector generates a condition variable comprising: Inputting the context vector into a characteristic linear modulation module; generating modulation parameters according to the context vector through the characteristic linear modulation module; and determining the condition variable according to the modulation parameter through the characteristic linear modulation module.
4. The method of claim 1, wherein generating a priori action based on the context vector and a preset motion rule comprises: mapping the context vector to an action space through a full connection layer of a neural network model, and generating a low-dimensional action feature corresponding to the context vector; Generating an initial candidate path according to the low-dimensional action characteristics and the preset motion rule; the prior action is determined from the initial candidate path.
5. The method according to claim 1, wherein the method further comprises: Based on Gaussian prior rules, carrying out data sampling from Gaussian white noise distribution to obtain the prior action; Or alternatively And training the initial observed image input condition variation in the initial image data from an encoder model to obtain the prior action.
6. The method of claim 1, wherein the inputting the prior action and the condition variable into a denoising diffusion bridge model for diffusion training, generating a target navigation action, comprises: inputting the prior action and the condition variable into the denoising diffusion bridge model; Forward noise adding training is carried out according to the real path point sequence through the denoising diffusion bridge model, and noise action is generated; And carrying out reverse denoising processing on the noise action according to the prior action and the condition variable through the denoising diffusion bridge model, and generating the target navigation action.
7. The method of claim 1, wherein after the inputting the prior action and the condition variable into a denoising diffusion bridge model for diffusion training to generate a target navigation action, the method further comprises: Generating a navigation action instruction according to the target navigation action; and sending the target navigation action and the navigation action instruction to a motion control module of the robot, and controlling the robot to execute the target navigation action according to the navigation action instruction through the motion control module.
8. A visual navigation system based on a denoising diffusion bridge model, the system comprising the following modules: The initial image data acquisition module is used for acquiring initial image data; The feature coding processing module is used for carrying out feature coding processing on the initial image data to generate a context vector; The linear modulation processing module is used for carrying out linear modulation processing on the context vector to generate a condition variable; The prior action generating module is used for generating prior actions according to the context vector and a preset motion rule; The target navigation action generating module is used for inputting the prior action and the condition variable into a denoising diffusion bridge model for diffusion training to generate a target navigation action, wherein the prior action is used as an initial state of the denoising diffusion bridge model when inverse denoising is performed.
9. An electronic device comprising a memory storing a computer program and a processor implementing the method of any of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.

Description

Visual navigation method and system based on denoising diffusion bridge model Technical Field The application relates to the technical field of computers, in particular to a visual navigation method and system based on a denoising diffusion bridge model. Background Visual navigation is an important technology for autonomous mobile robots to achieve target-directed movement in complex environments. The current visual navigation technology is mainly a mapping and path planning-based method or a deep learning-based method. The traditional method based on path planning is stable, but has higher dependence on accurate map and environment perception, has poorer adaptability in unknown or dynamic environments, is difficult to adjust navigation strategies in real time, has high training cost, low sample efficiency and limited generalization capability although being capable of adapting to complex environments, needs a large amount of training data to adapt to new environments, and has larger influence on final navigation performance due to data quality by demonstrating data through an expert in a simulated learning method although the navigation strategies are difficult to popularize in the unknown environments. In recent years, a visual navigation method based on a diffusion model becomes a research hotspot, however, the method relies on Gaussian noise as initial input, so that the distribution of target actions has larger deviation from actual navigation requirements, the calculation complexity of a denoising step is increased, in addition, the target actions generated by the diffusion model are sparse, and are difficult to be directly applied to complex navigation tasks, especially in application scenes requiring accurate control and immediate response. In summary, the technical problems in the related art are to be improved. Disclosure of Invention Embodiments of the present application aim to solve at least one of the technical problems in the related art to some extent. Therefore, the embodiment of the application mainly aims to provide a visual navigation method and a visual navigation system based on a denoising diffusion bridge model, which can effectively reduce redundant iteration in a diffusion process, reduce accumulated errors, greatly improve the efficiency and stability of motion generation and adapt to complex dynamic environments. In order to achieve the above objective, an aspect of the embodiments of the present application provides a visual navigation method based on a denoising diffusion bridge model, the method comprising the following steps: Acquiring initial image data; Performing feature coding processing on the initial image data to generate a context vector; Performing linear modulation processing on the context vector to generate a condition variable; Generating prior actions according to the context vector and a preset motion rule; and inputting the prior action and the condition variable into a denoising diffusion bridge model for diffusion training to generate a target navigation action, wherein the prior action is used as an initial state of the denoising diffusion bridge model when inverse denoising is performed. In some embodiments, the initial image data includes an initial observation image and an initial target image, the feature encoding processing is performed on the initial image data to generate a context vector, including: performing feature extraction processing on the initial observation image to obtain observation image features; Performing feature extraction processing on the initial target image to obtain target image features; and carrying out feature fusion processing on the observed image features and the target image features to generate the context vector. In some embodiments, the performing linear modulation processing on the context vector to generate a condition variable includes: Inputting the context vector into a characteristic linear modulation module; generating modulation parameters according to the context vector through the characteristic linear modulation module; and determining the condition variable according to the modulation parameter through the characteristic linear modulation module. In some embodiments, the generating a priori actions according to the context vector and a preset motion rule includes: mapping the context vector to an action space through a full connection layer of a neural network model, and generating a low-dimensional action feature corresponding to the context vector; Generating an initial candidate path according to the low-dimensional action characteristics and the preset motion rule; the prior action is determined from the initial candidate path. In some embodiments, the method further comprises: Based on Gaussian prior rules, carrying out data sampling from Gaussian white noise distribution to obtain the prior action; Or alternatively And training the initial observed image input condition variation in the initial image