CN-121982719-A - Multi-mode fusion indoor scene semantic occupation training method

CN121982719ACN 121982719 ACN121982719 ACN 121982719ACN-121982719-A

Abstract

A multi-modal fusion indoor scene semantic occupation training method relates to the technical field of 3D scene semantic occupation and comprises the steps of adopting a visual model to extract visual characteristics as a basis, optimizing semantic characteristic dimensions, providing basic space and semantic information characteristics, increasing text modal input, strengthening cognition of the model on semantic space, introducing a difficult voxel mining strategy, learning the difficult voxels in a targeted manner, and relieving the problem of model overfitting. The method combines the attention mechanism to optimize space and semantic feature dimension, improves the expression capability of visual features, introduces text modes to realize multi-modal feature fusion, enhances the cognition of a model to semantic space, optimizes training resource allocation through a difficult voxel mining strategy, and improves the learning effect of the model to a key region. The indoor scene semantic occupation prediction performance is remarkably improved, and more reliable 3D scene semantic support is provided.

Inventors

YUAN LICHEN
ZHANG ZHIZHONG
TAN XIN
XIE YUAN

Assignees

华东师范大学

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (3)

1. A multi-mode fusion indoor scene semantic occupation training method comprises the following steps: Step 1, optimizing a visual encoder Selecting visual large model Dinov as visual encoder, adjusting input image size by bilinear interpolation, deleting encoder output layer, directly obtaining all feature dimensions obtained by encoding, selecting low-dimensional visual and high-dimensional semantic features and transition layer features as visual features of subsequent model Lower layer A low-dimensional visual characteristic layer for capturing basic visual characteristics of edge, texture, color and the like, a middle layer Capturing transition information from visual features to semantic features as a transition layer, and a high-level Capturing high-abstract global semantic features, which are high-dimensional semantic feature layers; Step 2, optimizing space and semantic visual feature dimensions The three layers of visual characteristics obtained in the step 1 are processed Medium extraction high-rise As input, the two sub-modules of the position attention module and the channel attention module are simultaneously fed into to obtain the characteristics with the global context view And features with channel association information ; The features are subjected to And Splicing and fusing with lower layer Performing cross attention calculation to obtain optimized high-level visual characteristics Replacing the original ; Step 3, adding text mode input, and fusing the image and the text characteristics For each input image First, the corresponding accurate text description is obtained Word segmentation is carried out by accessing a word list of a language big model, and a text word segmentation vector is obtained Then to Text encoding is carried out, discrete text word segmentation vectors are converted into computable vector features, and meanwhile position encoding is carried out to preserve semantic logic relations of text sequences, so that the text sequences are obtained ; Characterizing the encoded text Self-attention is carried out, and the optimized visual characteristics obtained in the step 2 are obtained Cross attention fusion is carried out, and the cross alignment text characteristic and the image visual characteristic are obtained Finally obtaining text characteristics through a layer of feedforward neural network The cognition ability of the model to the semantic space is remarkably improved; step4, mining difficult voxels for key learning Will be visual characteristics And text features After the 3D space features are obtained by the decoder, the coordinates of the 3D voxel space are randomly sampled, and then the probability of each category is sampled at the random coordinates Extracting Maximum semantic probability And the second largest semantic probability According to The method comprises the steps of sampling random coordinates, calculating global difficulty of the sampled random coordinates, predicting a difficult voxel semantic category in a 3D space by using one-dimensional convolution, simultaneously returning a difficult voxel index, recording the position of the difficult voxel in a true value, introducing a cross entropy loss function in a loss function calculation stage, and carrying out loss calculation on a predicted value output by a difficult voxel predicted branch and a corresponding difficult voxel true value.
2. The multi-modal fusion indoor scene semantic occupation training method of claim 1, wherein the position attention module first inputs high-level images Respectively carrying out two-dimensional convolution and two linear layers to obtain the position inquiry feature Position matching features And location content features Then reorder And Multiplying the dimensions to obtain a spatial position relation diagram between the image features, and then Multiplying the spatial position relation diagram to obtain the result And the original characteristics are weighted and summed to make the weighted characteristics Features with a global context view.
3. The multi-modal fusion indoor scene semantic occupation training method of claim 1, wherein the channel attention module first inputs high-level images Respectively carrying out two-dimensional convolution and two linear layers to obtain channel inquiry characteristics Channel matching features And channel content features Then reorder And Multiplying the dimensions to obtain a relation diagram among characteristic channels, and then Multiplying the channel relation diagram to obtain the result Weighted sum is carried out on the original characteristics to obtain characteristics with channel associated information 。

Description

Multi-mode fusion indoor scene semantic occupation training method Technical Field The invention relates to the technical field of 3D scene semantic occupation, is suitable for application scenes such as robot navigation, intelligent home environment sensing, virtual reality or augmented reality scene construction and the like which need to accurately acquire indoor 3D space semantic information, and particularly relates to a multi-mode fusion indoor scene semantic occupation training method. Background Semantic occupation prediction is a core task of 3D scene understanding, and the core aim is to judge the semantic category of each voxel in the 3D space, such as empty voxels, walls, floors, sofas, chairs and the like, through an algorithm, so as to provide a bottom semantic support for the downstream intelligent interaction task. With the rapid development of robotics and VR/AR industries, the semantic occupation prediction requirements of indoor scenes are increasingly highlighted, and the performance of the semantic occupation prediction requirements directly determines the interaction precision and user experience of downstream applications. However, the existing indoor scene semantic occupation prediction technology still has a plurality of defects to be solved urgently: The current mainstream method is characterized by relying on image single mode to extract features, and is difficult to capture hidden semantic logic in the image only by visual information mining space and semantic association, so that objects with similar appearance but different semantics are insufficient in distinguishing capability, and the prediction precision is low. The visual feature expression is insufficient, the existing visual encoder lacks special optimization aiming at a semantic occupation task, the fusion of low-dimensional detail features and high-dimensional semantic features is insufficient, the spatial position accuracy and the global semantic consistency cannot be considered, and the feature expression capability is further restricted. The training resource allocation is unreasonable, a large number of empty voxels without actual semantics exist in the indoor 3D dense space, the existing method adopts a unified training strategy for all voxels, so that a large amount of computing resources are consumed on ineffective learning of the empty voxels, learning resources of voxels with difficult edge object areas and object shielding areas are insufficient, the training efficiency is reduced, and model overfitting is easily caused. The cross-modal information is not fully utilized, the text mode contains rich semantic description information, such as gray sofas and wooden tea tables in living rooms, and can supplement scene semantic understanding, but the text mode and the visual mode are not effectively fused in the prior art, and the semantic cognition capability of the model on complex scenes cannot be improved through the cross-modal information complementation. The problems are mutually overlapped, so that the adaptability and the prediction precision of the existing semantic occupation prediction method in a complex indoor environment are difficult to meet the actual application requirements, and the method becomes a key bottleneck for restricting the 3D scene understanding technology to fall to the ground. Therefore, the semantic occupation training method capable of integrating the multi-modal characteristics, optimizing the characteristic expression and efficiently distributing the training resources is developed, and has important technical value and practical significance. Disclosure of Invention The invention aims to overcome the defects of low semantic occupation prediction precision, insufficient feature expression and low training efficiency in the prior art, and provides a multi-mode fusion indoor scene semantic occupation training method, which is characterized in that 3 layers of visual features covering low-dimensional and high-dimensional information are extracted by selecting a visual large model Dinov, the space and semantic feature dimension are optimized by fusion of the design position and channel attention, the text mode is introduced to carry out cross attention fusion with the visual features after encoding so as to supplement deep semantic information, meanwhile, difficult voxel prediction branches are newly added, and key region learning such as model focusing edge objects is guided by adopting cross entropy loss, so that the remarkable improvement of the indoor scene semantic occupation prediction performance is realized. The technical scheme for realizing the aim of the invention is as follows: a multi-mode fusion indoor scene semantic occupation training method comprises the following steps: Step 1, optimizing a visual encoder Selecting visual large model Dinov as visual encoder, adjusting input image size by bilinear interpolation, deleting encoder output layer, directly obtaining all f