CN-121996993-A - Multi-mode large model space sensing method based on 3D scene graph driving

CN121996993ACN 121996993 ACN121996993 ACN 121996993ACN-121996993-A

Abstract

The invention discloses a multi-mode large model space perception method based on 3D scene graph driving, which comprises the steps of taking a 3D semantic scene graph algorithm as an additional visual encoder of a multi-mode large model after multi-mode original information is acquired, extracting the relation characteristics between 3D object instance characteristics and objects in a scene to obtain 3D perception capability, generating a structured scene topology description text, constructing a space perception representation of display and implicit coexistence, introducing a point cloud punctuation mechanism based on geometric primitives, extracting and solidifying standard geometric body characteristics through a shared 3D visual encoder to serve as anchor points for distinguishing different semantic units, constructing a point cloud projection layer to map the characteristics into semantic Token in a unified manner, and combining with a structured text prompt to form a composite input sequence. Through a dual context guiding mechanism, depth space perception of a large model on a 3D scene is realized, and question-answering accuracy and logic reasoning capability of the model in a complex three-dimensional environment are remarkably improved.

Inventors

HUANG KAIXIANG
WANG JIN
LU GUODONG
CHEN YONGHANG

Assignees

浙江大学

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (10)

1. A multi-mode large model space perception method based on 3D scene graph driving is characterized by comprising the following steps: step 1, acquiring multi-mode original information of a scene to be analyzed, wherein the multi-mode original information comprises a scene image, a user instruction and three-dimensional point cloud data; Step 2, inputting the preprocessed point cloud data into a pre-trained 3D scene graph visual encoder, analyzing object examples in a scene and space topological relations among objects, and respectively extracting example feature vectors and relation feature vectors; step 3, converting the 3D scene graph into a structured JSON format text according to the analysis result; step 4, respectively extracting characteristics of a standard cube point cloud and a standard sphere point cloud based on a predefined geometric primitive library, and taking the characteristics as example punctuation characteristics and relationship punctuation characteristics; step 5, constructing a point cloud projection layer, and uniformly mapping the example features, the relationship features and the punctuation features to an embedded space of a large language model; step 6, carrying out serialization assembly on the mapped features, and defining a 3D space feature initiator and a 3D space feature terminator to form a structured point cloud modal input sequence; step 7, splicing the JSON format text, the user text instruction and the 2D picture information with a structured point cloud modal input sequence to construct a multi-modal instruction input pool; and 8, inputting the data of the multi-mode instruction input pool into a large language model, and generating a large model reply containing space semantic understanding by using a self-attention mechanism.
2. The 3D scene graph driven multi-modal large model space perception method according to claim 1, wherein the step 1 comprises the steps of: Preprocessing three-dimensional point cloud scene data, removing sensor noise points, synchronously converting a scene image and a user instruction, extracting features of the scene image through a 2D visual encoder to obtain 2D scene picture features, and converting the user instruction into text for embedding.
3. The 3D scene graph driven multi-modal large model space perception method according to claim 1, wherein the step 2 comprises the steps of: and taking a 3D scene graph sensing algorithm as a point cloud visual encoder of the multi-mode large model for the input preprocessed three-dimensional point cloud scene, and respectively extracting branches from characteristic branches of point cloud examples and relations among the point cloud examples to obtain 3D scene graph representation and prediction results of scene graph nodes and edges.
4. The 3D scene graph driven multi-modal large model space perception method according to claim 1, wherein the step 3 comprises the steps of: And (3) constructing a mapping function based on the prediction result generated in the step (2), traversing all nodes and edges in the process, and generating a JSON format text, wherein the JSON format text comprises class labels, spatial boundary box attributes and topological connection relations of examples and is used for constructing a priori context of a text side.
5. The 3D scene graph driven multi-modal large model space perception method according to claim 1, wherein the step 4 comprises the steps of: Defining a standard geometric primitive as a punctuation mark element of a 3D point cloud characteristic, wherein the geometric primitive comprises a unit cube point cloud and a unit sphere point cloud which are respectively used for representing an example punctuation and a relationship punctuation; Extracting punctuation features by using a 3D scene graph visual encoder; after extracting punctuation features, the 3D scene graph visual encoder carries out freezing and solidifying treatment on the punctuation features, and solidified feature vectors are directly called in a follow-up reasoning step.
6. The 3D scene graph driving-based multi-mode large model space perception method according to claim 1, wherein in the step 5, the point cloud projection layer is composed of a multi-layer perceptron, and parameters thereof are updated in an instruction fine adjustment stage of the multi-mode large model so as to realize alignment of 3D geometric features of the scene and a large model text semantic space, and meanwhile, logical association of punctuation marks and subsequent semantic features is established; The point cloud projection layer is composed of a double-layer fully-connected network, and the distribution of the 3D geometric space is aligned to the text semantic space through nonlinear mapping, so that the large model can process geometric representation in a word Token processing mode.
7. The 3D scene graph driven multi-modal large model space perception method according to claim 1, wherein the step 6 comprises the steps of: and carrying out serialization assembly on the mapped features, constructing an input sequence with a clear boundary, defining a special controller < point start > and < point end > wrapping sequence, attaching a cube punctuation feature in front of each example Token according to a punctuation-content paired assembly principle, and attaching a sphere punctuation feature in front of each relation Token to form a structured 3D modal sequence.
8. The 3D scene graph driven multi-modal large model space perception method according to claim 1, wherein the step 7 comprises the steps of: splicing the JSON format text generated in the step 3, the user text instruction, the 2D scene picture features processed in the step 1 and the structured 3D mode sequence generated in the step 6, and uniformly pressing all mode information into an input buffer area of a large model to form a multi-mode instruction pool containing pixels, geometric topology and logic text.
9. The 3D scene graph driven multi-modal large model space perception method according to claim 1, wherein the step 8 comprises the steps of: Inputting the data of the multi-mode instruction pool into a multi-mode large language model decoder based on Qwen2.5-VL, keeping the 3D scene graph visual encoder frozen in a training stage, fine-adjusting large model parameters and a projection layer by using a low-rank self-adaptive technology, and generating accurate replies containing space semantic understanding by the model under the logic guidance of JSON text and the definition of 3D information boundaries of 'geometric punctuation'.
10. The 3D scene graph driving-based multi-modal large model space perception method according to claim 1, wherein the multi-modal large model adopts a next prediction task as a loss function in a training stage, and simultaneously optimizes the space description accuracy in text reply and the semantic alignment of a point cloud mode.

Description

Multi-mode large model space sensing method based on 3D scene graph driving Technical Field The invention relates to the technical field of artificial intelligence and computer vision, in particular to a construction and optimization technology of a multi-modal large language model (Multimodal Large Language Model, MLLM), and particularly relates to a multi-modal large model technology for realizing multi-modal sensing and alignment of a native 3D scene by introducing a 3D semantic scene graph vision encoder. Background With the rapid development of deep learning technology, large Language Models (LLMs) exhibit strong generic reasoning capabilities. By introducing a visual encoder, the multi-modal large model (MLLM) can simultaneously understand images and texts, and remarkable progress is made in tasks such as visual questions and answers, image description and the like. However, in application scenarios such as body intelligence (Embodied AI) and robot navigation facing the real physical world, the existing multi-modal large model still faces serious challenges, mainly in the following aspects: 1. Lacks native 3D spatial perception capabilities. Existing mainstream multi-modal large models are based primarily on 2D image input, attempting to infer three-dimensional spatial information through multi-view image projection. The dimension reduction processing mode causes depth information loss, blurring of occlusion relation and inaccurate understanding of a space topological structure (such as relative position distance among objects), and a model is difficult to construct a real and coherent 3D world model. 2. The problem of semantic confusion of long-sequence point cloud features. Although some research attempts to directly introduce 3D point cloud data, the original point cloud data is typically converted into a long and unordered Token sequence. The method cannot construct the dependence characteristic of the relation between the geometric characteristic of the object instance and the object, and the point cloud is characterized by promiscuous and lacks a definite semantic boundary. This results in models that are prone to "getting lost" in long context reasoning. 3. Explicit spatial logical booting is lacking. Pure data-driven training makes it difficult for the model to capture complex spatial logic. With implicit learning of point cloud features alone, models tend to create illusions and alignment with text expressions is difficult to achieve. A priori knowledge of structuring is lacking to aid in large model analysis and anchoring 3D characterization. Therefore, a new technical scheme is needed to introduce native 3D scene graph representation into a multi-modal large model, and solve the problem of perception of the large model to the 3D space through an effective multi-modal alignment and definition mechanism. Disclosure of Invention In order to solve the problems in the background art, the invention provides a multi-mode large model space sensing method based on 3D scene graph driving, which adopts the following technical scheme: A multi-mode large model space sensing method based on 3D scene graph driving comprises the following steps: step 1, acquiring multi-mode original information of a scene to be analyzed, wherein the multi-mode original information comprises a scene image, a user instruction and three-dimensional point cloud data; Step 2, inputting the preprocessed point cloud data into a pre-trained 3D scene graph visual encoder, analyzing object examples in a scene and space topological relations among objects, and respectively extracting example feature vectors and relation feature vectors; step 3, converting the 3D scene graph into a structured JSON format text according to the analysis result; step 4, respectively extracting characteristics of a standard cube point cloud and a standard sphere point cloud based on a predefined geometric primitive library, and taking the characteristics as example punctuation characteristics and relationship punctuation characteristics; step 5, constructing a point cloud projection layer, and uniformly mapping the example features, the relationship features and the punctuation features to an embedded space of a large language model; step 6, carrying out serialization assembly on the mapped features, and defining a 3D space feature initiator and a 3D space feature terminator to form a structured point cloud modal input sequence; step 7, splicing the JSON format text, the user text instruction and the 2D picture information with a structured point cloud modal input sequence to construct a multi-modal instruction input pool; and 8, inputting the data of the multi-mode instruction input pool into a large language model, and generating a large model reply containing space semantic understanding by using a self-attention mechanism. Further, the step 1 includes the following steps: Preprocessing three-dimensional point cloud scene data, removing sensor noise po