CN-121982698-A - Multi-mode semantic guidance three-dimensional target positioning method, medium and device

CN121982698ACN 121982698 ACN121982698 ACN 121982698ACN-121982698-A

Abstract

The invention provides a multi-mode semantic guided three-dimensional target positioning method, medium and device, wherein the method is realized based on a multi-mode semantic guided three-dimensional target positioning model and comprises a multi-view semantic priori module, a text coding module, a double-branch point cloud coding and multi-source comparison supervision module, a sparse scene graph construction and graph relation learning module and a positioning decoding module, wherein the multi-view semantic priori module is used for dividing a 3D scene point cloud into a 3D object point cloud and generating a multi-view 2D visual representation, the semantics of the multi-view semantic priori module are encoded, the double-branch point cloud coding and multi-source comparison supervision module is used for injecting semantic features into the 3D object point cloud to obtain 3D fusion features and realizing multi-source feature alignment through multi-source comparison supervision, the sparse scene graph construction and graph relation learning module is used for constructing a sparse scene graph and optimizing through a graph attention network, and the positioning decoding module is used for decoding and outputting positioning results. The method can improve the target positioning capability of the model in a complex scene.

Inventors

KANG WENXIONG
XIAO FENG

Assignees

华南理工大学

Dates

Publication Date: 20260505
Application Date: 20260109

Claims (10)

1. The multi-mode semantic guidance three-dimensional target positioning method is characterized by being realized based on a multi-mode semantic guidance three-dimensional target positioning model, wherein the multi-mode semantic guidance three-dimensional target positioning model comprises a multi-view semantic priori module, a text coding module, a double-branch point cloud coding and multi-source comparison supervision module, a sparse scene graph construction and graph relation learning module and a positioning decoding module; The multi-view semantic prior module is used for dividing an input original 3D scene point cloud into a plurality of 3D object point clouds from a scene, generating multi-view 2D visual representation for each 3D object point cloud, and encoding the semantics of the multi-view 2D visual representation by utilizing a pre-training visual model to obtain CLIP semantic features; Analyzing and extracting category prompt words from natural language instructions input by a user, and generating language features including CLIP text features, BERT description text features and BERT prompt word text features; injecting CLIP semantic features into the 3D object point cloud to obtain 3D fusion features, and realizing multi-source feature alignment through multi-source comparison supervision; the sparse scene graph construction and graph relation learning module is used for constructing a task-driven sparse scene graph by utilizing the matching relation between 3D fusion features and BERT prompt word text features and optimizing the spatial relation among nodes in the sparse scene graph through a graph attention network; The positioning decoding module decodes and outputs a final positioning result according to the optimized sparse scene graph nodes.
2. The multi-modal semantic guided three-dimensional object localization method of claim 1, wherein the multi-view semantic prior module comprises the steps of: a1, receiving an original 3D scene point cloud and a multi-view 2D image with a pose; A2, clustering original 3D scene point clouds based on geometric density, dividing the scene into a plurality of three-dimensional object candidate areas which are not overlapped with each other so as to divide a plurality of 3D object point clouds, wherein each 3D object point cloud represents an independent point cloud subset; A3, projecting a three-dimensional bounding box of the 3D object point cloud onto each view angle 2D image, and cutting out a corresponding 2D image area, so as to obtain a group of multi-view angle 2D visual representations of the 3D object point cloud; a4, encoding the 2D visual representation under each view angle by using a pre-trained and parameter frozen CLIP visual encoder, and extracting high-level semantic feature vectors to obtain CLIP semantic features.
3. The multi-modal, semantically-guided, three-dimensional object localization method of claim 1, wherein the text encoding module comprises the steps of: B1, receiving a natural language instruction input by a user; b2, performing morphological reduction on the natural language instruction text, and extracting core noun components as category prompt words; B3, encoding the category prompt words by using a pre-trained CLIP text encoder to generate CLIP text characteristics; And B4, encoding the original user natural language text instruction and the category prompt words by using the BERT model to obtain BERT description text features and BERT prompt word text features.
4. The multi-mode semantic guided three-dimensional object positioning method according to claim 1, wherein the dual-branch point cloud coding and multi-source comparison supervision module comprises the following steps: C1, constructing a double-branch multi-mode coding network, wherein the double-branch multi-mode coding network comprises geometric branches and enhancement branches; Step C2, performing step-by-step downsampling and feature extraction on the 3D object point cloud by a three-layer Set extraction operation by using the geometric branches, and outputting geometrically dominant object-level geometric features; c3, the enhancement branch aggregates the CLIP semantic features of the same 3D object point cloud under different view angles to generate 2D enhancement semantic features; c4, linearly projecting the 2D enhanced semantic features to the same dimension of the geometric features through a learnable linear projection layer, and fusing the 2D enhanced semantic features in a residual error connection mode to obtain 3D fused features : ; Wherein, the Is shown in The geometric features of the fusion of the layers, Representing the 2D enhanced semantic features, Representing the mapping of features to the same dimension; c4, multisource comparison supervision, namely constructing a composite comparison loss function, and forcibly zooming in the distances of the following five positive sample pairs in a semantic space, namely 3D object point cloud and BERT prompt word text characteristics, 3D object point cloud and CLIP semantic characteristics, BERT prompt word text characteristics and CLIP text characteristics, 3D object point cloud and CLIP text characteristics, CLIP semantic characteristics and BERT prompt word text characteristics.
5. The multi-modal semantic guided three-dimensional object localization method of claim 4, wherein the sparse scene graph construction and graph relationship learning module comprises the steps of: D1, calculating semantic similarity among objects based on the 3D fusion characteristics of all the objects and category prompt words obtained by the text coding module, and primarily screening candidate object pairs with high semantic similarity; then, establishing edges between object pairs meeting the conditions only according to whether the BERT prompt word text features contain an explicit semantic co-occurrence relationship or not to form a sparse scene graph with sparse connection; D2, performing two rounds of iterative optimization based on each node feature in the sparse scene graph, wherein each round of iterative optimization comprises three core operations, namely 3D fusion features, geometric features and BERT description text features: Performing visual-language semantic alignment operation, namely performing contextual query and update on node features by utilizing BERT description text features through a cross-modal cross-attention mechanism, and enhancing the semantic directivity of the node features; The graph attention relation aggregation operation comprises the steps of inputting updated node characteristics into a graph attention network, wherein the graph attention network enables a target node to aggregate information from the most relevant reference object node by adaptively calculating contribution weights of neighbor nodes; And a geometrical relationship fusion operation, namely fusing the topological relationship features obtained by the graph attention relationship aggregation operation with the original geometrical features.
6. The multi-modal semantic guided three-dimensional object localization method of claim 5, wherein in the graph annotation meaning relationship aggregation operation, the attention mechanism of the graph annotation meaning network calculates the contribution weight of the neighborhood node to the current node i according to the following formula : ; Wherein, the Representing a set of neighboring nodes of node i, E {0,1} represents the original adjacency between nodes i and j, E (0, 1) capturing the normalized attention weights of nodes j to i through the kth attention header; a learnable projection matrix for the kth attention head, K being the number of attention heads, Representing a nonlinear activation function.
7. The multi-modal, semantically guided, three-dimensional object localization method of claim 1, wherein the localization decoding module comprises: the optimized sparse scene graph nodes are input into a lightweight transducer decoder, the transducer decoder outputs a positioning score for each node through a multi-layer perceptron, and the node with the highest positioning score is output as a final positioning result.
8. The method for multi-modal semantic guided three-dimensional object localization according to claim 1, wherein the training loss function L of the multi-modal semantic guided three-dimensional object localization model is: ; Wherein, the Alignment loss for multisource contrast supervision; To locate regression loss; performing multi-class cross entropy for text-guided target classification loss based on BERT description text features and 3D fusion features; auxiliary classification loss for object types, again calculated using cross entropy functions; - To lose weight.
9. A readable storage medium, wherein the storage medium has stored thereon a computer program which, when executed by a processor, causes the processor to perform the multi-modal, semantically guided three-dimensional object localization method of any one of claims 1-8.
10. A computer device comprising a processor and a memory for storing a processor executable program, wherein the processor, when executing the program stored in the memory, implements the multi-modal semantic guided three-dimensional object localization method of any one of claims 1-8.

Description

Multi-mode semantic guidance three-dimensional target positioning method, medium and device Technical Field The invention relates to the technical field of intersection of computer vision and artificial intelligence, in particular to a multi-mode semantic guidance three-dimensional target positioning method, medium and equipment. Background The three-dimensional visual positioning technology is widely applied to the fields of service robots, automatic driving, augmented reality, industrial digital twinning and the like, and is used for realizing navigation by understanding natural language instructions to position target objects in a three-dimensional scene, for example, the service robots execute a grabbing task according to the condition that red cups on tea tables are transferred to me, and an automatic driving system is used for understanding a second traffic sign in front. The traditional three-dimensional positioning technology mainly relies on predefined category detection, only outputs the positions of objects in fixed categories such as 'chairs', 'tables', and the like, or manually marks anchor points, has limited applicable scenes, weak generalization capability and lack of semantic understanding, and is difficult to deal with any reference in the open world. With the development of artificial intelligence technology, the latest three-dimensional visual positioning technology has the understanding capability and the cross-scene reasoning capability of complex spatial relationships, and realizes open vocabulary support, single-view efficient reasoning, structural relationship reasoning and the like, thereby comprehensively exceeding the traditional method in terms of precision, generalization and practicability. However, the prior art has obvious disadvantages that (1) most methods adopt a learning mechanism with a target as a center, neglect explicit modeling of a reference object and are difficult to process a complex scene containing a plurality of similar interferents, and (2) the current scene graph construction strategy is rough, for example, feng et al construct a full-connection scene graph by taking all objects in the scene as nodes, so that all edges need to participate in calculation, and a large number of redundant relations exist. Huang et al propose to connect all nodes by using KNN relation with fixed adjacent distance, but the scheme can not capture long-distance semantic relation, (3) inherent sparsity and lack of texture detail of three-dimensional point cloud limit recognition capability of object attribute, (4) 2D-3D fusion scheme proposed by Zhang et al simply splice features of two dimensions, and break cross-modal alignment relation established by pre-training vision-language model, which is unfavorable for three-dimensional semantic understanding. These limitations result in significant degradation of positioning accuracy in complex three-dimensional scenes, particularly when dealing with scenes with fine-grained spatial relationship descriptions. Disclosure of Invention The invention aims to overcome the defects and shortcomings of the existing three-dimensional visual positioning technology, particularly the problems of insufficient spatial relationship reasoning capability, weak cross-modal alignment effect and reference modeling deficiency, and provides a multi-mode semantic guided three-dimensional target positioning method, medium and equipment. The multi-mode semantic guided three-dimensional target positioning method is realized based on a multi-mode semantic guided three-dimensional target positioning model, and the multi-mode semantic guided three-dimensional target positioning model comprises a multi-view semantic priori module, a text coding module, a double-branch point cloud coding and multi-source comparison supervision module, a sparse scene graph construction and graph relation learning module and a positioning decoding module; The multi-view semantic prior module is used for dividing an input original 3D scene point cloud into a plurality of 3D object point clouds from a scene, generating multi-view 2D visual representation for each 3D object point cloud, and encoding the semantics of the multi-view 2D visual representation by utilizing a pre-training visual model to obtain CLIP semantic features; Analyzing and extracting category prompt words from natural language instructions input by a user, and generating language features including CLIP text features, BERT description text features and BERT prompt word text features; injecting CLIP semantic features into the 3D object point cloud to obtain 3D fusion features, and realizing multi-source feature alignment through multi-source comparison supervision; The sparse scene graph construction and graph relation learning module is used for constructing a task-driven sparse scene graph by utilizing 3D fusion features and BERT description text features and optimizing the spatial relation among nodes in the sparse s