CN-122015803-A - Vision-semantic map construction method in cross-scene generalization

CN122015803ACN 122015803 ACN122015803 ACN 122015803ACN-122015803-A

Abstract

The invention discloses a visual-semantic map construction method in cross-scene generalization, which relates to the technical field of robot vision and has the technical scheme that: the method comprises three steps of self-adaptive visual feature extraction, semantic alignment and fusion and hierarchical path planning. The method can solve the problems that the semantic topological map is difficult to construct and the hierarchical path planning based on language instructions is affected due to visual feature distribution deviation of the intelligent agent in a non-contact scene. The invention obviously improves the accuracy of visual-semantic map construction of an agent under a cross-scene condition and the efficiency and robustness of navigation based on language instructions through an innovative multi-module cooperative mechanism, and has wide application prospect in the fields of intelligent houses, service robots, automatic driving and the like.

Inventors

LIU RUONAN

Assignees

上海交通大学

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (8)

1. The visual-semantic map construction method in cross-scene generalization is characterized by comprising the following steps of: (1) The self-adaptive visual feature extraction is carried out by performing meta training in a plurality of different source scenes, learning the general visual feature extraction capability irrelevant to the scenes, and carrying out quick fine adjustment on a network by using a small amount of data in a target scene so as to extract the visual features adapting to the target scene; (2) Establishing a multi-mode semantic space, mapping visual features, text semantic information and space topology information into a unified semantic space, modeling nodes and edges in the semantic space by using a graph neural network, calculating the similarity of the visual features and the text semantics in the semantic space to realize semantic alignment, and integrating the semantic information into a visual map by combining the space topology information to establish a semantic topology map; (3) Hierarchical path planning, namely, path planning is carried out by adopting a hierarchical reinforcement learning algorithm based on a constructed semantic topological map, a navigation task is decomposed into a high-level semantic decision and a bottom-layer action execution, the high-level semantic decision selects a proper semantic target node in the semantic topological map according to a language instruction, and the bottom-layer action execution generates a specific action instruction according to the relative position relation between the current position and the target node to guide the intelligent body to move.
2. The visual-semantic map construction method in cross-scene generalization according to claim 1, characterized by comprising the steps of collecting different types of source scene data and dividing the source scene data into corresponding meta-learning tasks, wherein each task comprises a support set and a query set, constructing a convolutional neural network-based meta-learning feature extraction network, performing forward propagation calculation feature representation on the support set of each task, performing classification prediction based on features and calculating losses on the support set, updating network parameters through a gradient descent algorithm to obtain task specific parameters, performing prediction on the query set by using the updated parameters, and iteratively training the network with the aim of minimizing loss of all task query sets.
3. The method for constructing a visual-semantic map in cross-scene generalization according to claim 2, wherein the target scene fine tuning stage is characterized in that a small amount of image data is collected in a target scene to form a fine tuning data set, the parameters of a bottom convolution layer of a meta-training network are fixed, only the parameters of a high-level network are adjusted, and a feature extraction network adapting to the target scene is obtained by optimizing a fine tuning loss function.
4. The visual-semantic map construction method in cross-scene generalization according to claim 1, wherein the construction of the multi-modal semantic space is specifically that visual feature vectors output by a visual feature extraction network are mapped to the semantic space by using a full-connection layer, text semantic information is encoded by using a pre-trained language model and mapped to the semantic space by using the full-connection layer, a topological graph is constructed according to motion odometer data and sensor data of an intelligent agent, and nodes and edge features of the topological graph are mapped to the semantic space by using an encoding function.
5. The method for constructing the visual-semantic map in cross-scene generalization of claim 4, wherein the semantic alignment is specifically that similarity scores of each visual feature semantic node and all text semantic nodes are calculated, information transmission and feature updating are carried out on nodes in a semantic space by using a graph neural network, the text semantic node with the highest score is selected to be aligned with the visual feature node by comparing the similarity scores, and the alignment is temporarily not carried out if the highest score is lower than a set threshold.
6. The method for constructing the visual-semantic map in cross-scene generalization of claim 5, wherein the semantic fusion is specifically that the aligned semantic information is used as an attribute to be added to a corresponding visual map node in a space topological graph, the weights of edges in the topological graph are updated according to the relation among the semantic nodes, and a semantic topological map is obtained through repeated iterative updating.
7. The visual-semantic map construction method in cross-scene generalization according to claim 1, wherein the high-level semantic decision is specifically characterized by comprising the steps of inputting a natural language instruction into a natural language processing module, extracting a semantic keyword set and target position semantic information, calculating the similarity between a keyword embedded vector and a semantic node semantic representation vector in a semantic topological map, selecting nodes with the similarity higher than a threshold value to form a target node set, defining a state space, an action space and a reward function of the high-level semantic decision, and training by adopting a reinforcement learning algorithm to obtain an optimal semantic path strategy from a current semantic node to a target semantic node.
8. The visual-semantic map construction method in cross-scene generalization according to claim 1, wherein the bottom-layer action execution is specifically defined as a state space of bottom-layer action execution, including an agent current position coordinate, sensor data and a next semantic node designated by a high-level, the defined action space is a discrete action set executable by the agent, a reward function of bottom-layer action execution is defined, a reinforcement learning algorithm is adopted to train to obtain an optimal action execution strategy for executing a proper action according to a current state, and in the navigation process, if the distance between the action and an obstacle is detected to be smaller than a safety threshold, a high-level semantic decision module is triggered to reprogram a path.

Description

Vision-semantic map construction method in cross-scene generalization Technical Field The invention relates to the technical field of robot vision, in particular to a visual-semantic map construction method in cross-scene generalization. Background With the rapid development of artificial intelligence and robot technology, visual language navigation has shown great application potential in the fields of intelligent home, service robots, automatic driving and the like. The visual-semantic map is used as a core foundation of visual language navigation, and aims to combine scene information perceived by the robot in visual sense with semantic information to construct a map which contains space topological relations and has semantic understanding, so that intelligent navigation of the robot based on language instructions is realized. However, in practical applications, the problem of cross-scene generalization becomes a key bottleneck restricting efficient construction of visual-semantic maps and improvement of visual language navigation performance. The conventional visual-semantic map construction method is mostly based on training data in specific scenes, and a map is constructed by extracting visual features and associated semantic information. In a training scene, the methods can well complete map construction and navigation tasks. However, when an agent enters a building structure or a spatial layout that is not contacted during the training phase, problems ensue due to the significant shift in the visual characteristic distribution. The differences of illumination conditions, object placement, spatial scale, architectural style and the like in different scenes are huge, so that the visual features extracted by the traditional method are not representative and stable any more. For example, visual features used to identify a door in a training scene may not be accurately identified in a new scene due to changes in the door's material, color, decoration, etc. Furthermore, the conventional method has limitations in terms of semantic association. They often rely on fixed semantic templates and predefined scene models, and are difficult to accommodate for the diversity and variability of semantic concepts in new scenes. When a specific object category or a spatial semantic relation in a new scene is encountered, visual information and semantic information cannot be accurately matched, and an effective semantic topological map cannot be established. The semantic topological map is a basis for realizing hierarchical path planning based on language instructions, and once the semantic topological map is constructed to be failed, an intelligent body cannot understand semantic information in the language instructions and cannot plan a reasonable path, so that a navigation task is failed. Some existing improved methods attempt to alleviate the cross-scene generalization problem by increasing the diversity of training data, but this approach is not only costly, but also difficult to cover all possible scenes. Some researches adopt a migration learning strategy to migrate the knowledge learned in the source scene to the target scene, but the migration effect is not ideal due to the overlarge scene difference. Disclosure of Invention The invention aims to provide a visual-semantic map construction method in cross-scene generalization, which can effectively solve the problem of visual feature distribution deviation in the cross-scene generalization, realize accurate semantic topological map construction and improve hierarchical path planning capacity based on language instructions. The technical aim of the invention is realized by the following technical scheme that the visual-semantic map construction method in cross-scene generalization comprises the following steps: (1) The self-adaptive visual feature extraction is carried out by performing meta training in a plurality of different source scenes, learning the general visual feature extraction capability irrelevant to the scenes, and carrying out quick fine adjustment on a network by using a small amount of data in a target scene so as to extract the visual features adapting to the target scene; (2) Establishing a multi-mode semantic space, mapping visual features, text semantic information and space topology information into a unified semantic space, modeling nodes and edges in the semantic space by using a graph neural network, calculating the similarity of the visual features and the text semantics in the semantic space to realize semantic alignment, and integrating the semantic information into a visual map by combining the space topology information to establish a semantic topology map; (3) Hierarchical path planning, namely, path planning is carried out by adopting a hierarchical reinforcement learning algorithm based on a constructed semantic topological map, a navigation task is decomposed into a high-level semantic decision and a bottom-layer action execution