CN-122024236-A - Vocabulary retrieval method based on dynamic granularity adjustment and multi-mode context fusion

CN122024236ACN 122024236 ACN122024236 ACN 122024236ACN-122024236-A

Abstract

The invention discloses a vocabulary retrieval method based on dynamic granularity adjustment and multi-mode context fusion, which comprises the core steps of converting multi-frame RGB-D original data into a structured 3D point cloud, generating a high-precision 2D target mask through density evaluation-quality feedback-filtering optimization, generating a strong distinction semantic feature vector through integrating multi-scale vision and language priori information, ascending dimensions of the 2D mask and the semantic feature to a 3D space to generate a coherent and accurate 3D semantic mask set, analyzing free text of a user, screening related candidates, combining visual angles with space constraint to be matched accurately, and selecting a result with highest score. The invention realizes dynamic granularity adjustment and multi-mode fusion, improves vocabulary retrieval accuracy and efficiency, provides a reliable basis for 3D semantic matching and target retrieval, and supports practical applications such as robot grabbing, AR anchoring and the like.

Inventors

Fang Chunxin
ZHAO TIANCHENG
Liao jiajia
LIU PENG
ZHANG QIANQIAN

Assignees

杭州联汇科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20251226

Claims (10)

1. A vocabulary searching method based on dynamic granularity adjustment and multi-mode context fusion is characterized by comprising the following steps: step one, converting multi-frame RGB-D original data into a structured 3D point cloud to provide a space carrier for subsequent semantic analysis; Generating a high-precision 2D target mask through logic of density evaluation-quality feedback-filtering optimization so as to provide a reliable 2D region basis for subsequent semantic coding; Integrating multi-mode information of 'multi-scale vision + language priori', generating semantic feature vectors with strong distinction, and providing a core basis for 3D semantic matching; step four, the multi-view 2D mask in the step two and the semantic feature of the step three are 'upscaled' to a 3D space, and a coherent and accurate 3D semantic mask set is generated through a voxel-consistency combination-noise filtering process, so that a 3D space semantic carrier is provided for target retrieval; And fifthly, carrying out structural analysis on free text input by a user, screening candidates related to the targets and the reference objects from the 3D semantic mask set in the fourth step, accurately screening the candidate set according to view angles and space constraints in query, calculating multi-mode matching scores of each candidate target and the free text, and selecting the candidate target with the highest score as a final retrieval result.
2. The vocabulary searching method based on dynamic granularity adjustment and multi-modal context fusion according to claim 1, wherein the specific step of converting multi-frame RGB-D raw data into structured 3D point cloud in the first step is as follows: step one, defining input data, setting time sequence The corresponding input data is RGB image Image height and width respectively, three channels of RGB 3, depth map Each pixel value represents the physical distance from the point to the camera in meters, and the pose of the camera Special euclidean group comprising translation vectors of camera in world coordinate system And rotation matrix Describing camera position and posture, camera internal reference matrix Determined by camera hardware parameters in the form of Wherein As the focal length of the x/y axis, The coordinates of the center point of the image; Step one, for RGB images Each pixel of (3) , Is in the form of a horizontal coordinate and, Is a vertical coordinate, firstly converted into a homogeneous coordinate Recombined depth map Back-projected as 3D points in world coordinate system with camera parameters The formula is: , Wherein, the The matrix is an inverse matrix of the camera internal reference matrix and is used for converting pixel coordinates into normalized coordinates under a camera coordinate system; Is a pixel For scaling the normalized coordinates to 3D coordinates in the camera coordinate system; The camera pose matrix is used for converting 3D coordinates in a camera coordinate system into 3D coordinates in a world coordinate system; Step one, three, for all time steps Is (are) back projected 3D points Performing de-duplication and splicing to finally generate a global 3D point cloud The total point number of the point cloud is typically millions.
3. The vocabulary searching method based on dynamic granularity adjustment and multi-modal context fusion according to claim 2, wherein the specific steps of generating the high-precision 2D target mask in the second step are as follows: Firstly, segmenting a model based on SEMANTICSAM, firstly, carrying out edge detection and connected domain analysis, and counting images Number of preliminary candidate targets Re-calculating the target density Number of candidate targets per unit area: According to the following Adaptively determining initial granularity The judgment mode is as follows: If it is Then Fine particle size; If it is Then ; Step two, based on initial granularity Generating a 3-level granularity sequence The quality feedback of the previous level of mask is used for adjusting the granularity of the next level, so that the gradual optimization of coarse, fine and excellent is realized; Step two, optimizing the mask set generated by the 3-level granularity in two steps, removing redundancy and noise, and filtering the redundancy, wherein the step two is that Level mask Calculate its and before Level mask set Maximum overlap ratio of (2) Reserved, reserve Mask of (3) to avoid repeated labeling, noise filtering ① to remove area And ② applying DBSCAN clustering to the pixel areas of the remaining masks.
4. The vocabulary searching method based on dynamic granularity adjustment and multi-mode context fusion according to claim 3, wherein the specific mode of the progressive optimization of 'coarse, fine, excellent' in the second step is as follows: Grade 1 use Generating a mask set ( Number of level 1 masks); Level 2 computing Mask quality score of (2) The integrity and compactness of the device are combined, Wherein Mask compactness, a closer to 1 indicates a more regular shape, if Then Vice versa Generating a mask set ; Level 3, repeat level 2 logic according to Adjusting to obtain Generating a mask set 。
5. The vocabulary searching method based on dynamic granularity adjustment and multi-modal context fusion according to claim 1 or 2, wherein the specific steps of generating the semantic feature vector with strong distinction in the third step are as follows: step three, for each mask Extracting 4 types of complementary visual contexts from two dimensions of an RGB image and a 3D point cloud, and covering 'local detail-neighbor association-global structure-geometric attribute'; Step three, extracting visual characteristics and language priori, wherein the visual characteristics are extracted by using 4 types of visual contexts 、、、 Respectively input to the CLIP image encoder (pre-training model, output feature dimension ) Obtaining the corresponding visual characteristic vector Extracting category priori features of common targets of the scene through a pre-training language model according to the scene type, and generating language feature vectors for each category Obtaining scene-level language priori features through averaging pooling Providing semantic guidance information for the number of scene categories; thirdly, dynamically distributing weights of 4 types of visual features by adopting a language priori guided attention mechanism, and realizing self-adaptive fusion of multi-modal information: attention weight calculation for each type of visual feature Calculate its and language prior feature Cosine similarity of (2) Obtaining the attention weight through Softmax function normalization : (The higher the similarity the greater the weight, ensuring that information related to scene semantics is retained preferentially); The multi-mode feature aggregation, namely linearly fusing the visual features weighted by attention with language priori features to obtain initial aggregation features : In the following The contribution of vision and language information is balanced for language priori weight; Feature enhancement and normalization to improve feature robustness Adding small Gaussian noise Through again Normalization ensures feature scale consistency to obtain final multi-mode semantic features : 。
6. The vocabulary searching method based on dynamic granularity adjustment and multi-modal context fusion according to claim 2, wherein the specific steps of generating a coherent and accurate 3D semantic mask set in the fourth step are as follows: Step four, for each 2D mask Firstly, according to the back projection logic of the step one and three, a corresponding 3D point set is obtained And then to Voxelization (voxel size) I.e. each voxel represents a real world Cm cube region), generating a 3D voxel occupancy set The formula is: ; In the middle of Represent the first The spatial range of each 3D voxel is obtained by dividing a world coordinate system according to the size of the voxel; step four, combining redundant masks through semantic consistency judgment, namely firstly carrying out semantic consistency scoring, namely masking any two masks from different visual angles And (3) with Integrating the voxel overlapping degree and the feature similarity, and defining a 3D semantic consistency score : In the following For the voxel intersection ratio, Representing the volume of the voxel set, Cosine similarity of semantic features for two masks, then masking is combined according to the following rule if Will then And (3) with Merging into a single 3D mask The combined voxel set is The combined semantic features are weighted average by voxel volume: ; step four, performing two-step noise filtering on the combined 3D mask set to improve semantic purity, wherein the method specifically comprises the steps of tiny target filtering, namely removing the number of voxels 3D mask of (C), isolated voxel filtering, voxel set for remaining 3D mask Fitting the 3D surface by using a RANSAC algorithm, calculating the distance between each voxel and the fitted surface, removing isolated voxels with the distance of more than 0.02m, and finally obtaining a global 3D semantic mask set Corresponding semantic feature sets Number of 3D semantic masks.
7. The vocabulary searching method based on dynamic granularity adjustment and multi-modal context fusion according to claim 1 or 2, wherein the specific way of screening candidates related to the target and the reference object from the 3D semantic mask set in the fourth step is as follows: Text feature generation: pair And (3) with Generating a CLIP text prompt, and inputting the CLIP text prompt into a CLIP text encoder to obtain corresponding text characteristics ; Semantic similarity computation, masking each 3D semantic mask Features of (2) Respectively calculate and Cosine similarity of (c): , 。
8. Candidate screening, reservation As target candidate set Reserved, reserve As a reference candidate set And (5) primarily filtering irrelevant targets.
9. The method for vocabulary searching based on dynamic granularity adjustment and multi-modal context fusion as claimed in claim 7, wherein the specific step of precisely screening candidate sets for view angle and space constraint in query in the fifth step is as follows, view angle constraint verification is performed first, ① reference candidate is calculated 3D center coordinates of (2) ( Is a voxel Center coordinates of (c), ② according to (E.g. "facing a television cabinet") determining the "frontal" direction of a reference object (as judged by 3D geometric features, e.g. planar orientation of the television cabinet), generating to Calculating target candidate by ③ for origin and front view angle coordinate system with positive z-axis direction Center of (2) Retaining target candidates located within a viewing angle range (e.g., ±30°) at azimuth angles in a viewing angle coordinate system; Spatial relationship verification ① computation And (3) with 3D space vector of (a) According to ② Vector is calculated Projection to x-y plane of view angle coordinate system, judging whether the space relation is satisfied, ③ retaining target candidates meeting both view angle and space constraint to obtain final candidate set 。
10. The vocabulary searching method based on dynamic granularity adjustment and multi-modal context fusion according to claim 8, wherein the specific way of calculating the multi-modal matching score between each candidate object and the free text and selecting the candidate object with the highest score as the final searching result is as follows: For a pair of Each candidate object in (a) Calculate it and query Multi-modal match score of (2): In the formula, For target candidates ; Is the matching degree of the spatial relationship; Selecting the candidate target with highest score as the final search result, and outputting the 3D semantic mask 3D center coordinates Corresponding semantic features And providing accurate spatial information for subsequent robot grabbing, AR anchoring and other applications.

Description

Vocabulary retrieval method based on dynamic granularity adjustment and multi-mode context fusion Technical Field The invention relates to the technical field of intersection of computer vision and 3D scene understanding, in particular to a vocabulary retrieval method based on dynamic granularity adjustment and multi-mode context fusion. Background The existing 3D open vocabulary understanding method has three main core defects that masks generated by a traditional 2D segmentation model are fragile and incomplete in a disordered indoor environment, semantic continuity is poor after 3D projection, visual-language model coding lacks multi-mode context guidance and is easy to generate semantic ambiguity (such as misjudging a 'desktop edge' as a 'plank'), natural language retrieval is difficult to process multi-layer logic of 'category semantic-spatial relation-visual angle constraint', and complex query positioning accuracy is low. Disclosure of Invention Aiming at the defects of the prior art, the invention aims to provide the 3D semantic segmentation and natural language driven target retrieval method oriented to the open environment, which can be directly applied to tasks such as self-contained intelligent (EmbodiedAI) interaction, robot autonomous navigation and manipulation, augmented Reality (AR)/Virtual Reality (VR) scene anchoring and the like, and realizes high-precision 3D scene semantic analysis and target positioning under the specific training conditions without 3D annotation data and tasks. In order to achieve the above purpose, the invention provides a vocabulary searching method based on dynamic granularity adjustment and multi-mode context fusion, which is characterized by comprising the following steps: step one, converting multi-frame RGB-D original data into a structured 3D point cloud to provide a space carrier for subsequent semantic analysis; Generating a high-precision 2D target mask through logic of density evaluation-quality feedback-filtering optimization so as to provide a reliable 2D region basis for subsequent semantic coding; Integrating multi-mode information of 'multi-scale vision + language priori', generating semantic feature vectors with strong distinction, and providing a core basis for 3D semantic matching; step four, the multi-view 2D mask in the step two and the semantic feature of the step three are 'upscaled' to a 3D space, and a coherent and accurate 3D semantic mask set is generated through a voxel-consistency combination-noise filtering process, so that a 3D space semantic carrier is provided for target retrieval; And fifthly, carrying out structural analysis on free text input by a user, screening candidates related to the targets and the reference objects from the 3D semantic mask set in the fourth step, accurately screening the candidate set according to view angles and space constraints in query, calculating multi-mode matching scores of each candidate target and the free text, and selecting the candidate target with the highest score as a final retrieval result. As a further improvement of the present invention, the specific step of converting the multi-frame RGB-D raw data into the structured 3D point cloud in the first step is as follows: step one, defining input data, setting time sequence The corresponding input data is RGB imageImage height and width respectively, three channels of RGB 3, depth mapEach pixel value represents the physical distance from the point to the camera in meters, and the pose of the cameraSpecial euclidean group comprising translation vectors of camera in world coordinate systemAnd rotation matrixDescribing camera position and posture, camera internal reference matrixDetermined by camera hardware parameters in the form ofWhereinAs the focal length of the x/y axis,The coordinates of the center point of the image; Step one, for RGB images Each pixel of (3),Is in the form of a horizontal coordinate and,Is a vertical coordinate, firstly converted into a homogeneous coordinateRecombined depth mapBack-projected as 3D points in world coordinate system with camera parametersThe formula is: , Wherein, the The matrix is an inverse matrix of the camera internal reference matrix and is used for converting pixel coordinates into normalized coordinates under a camera coordinate system; Is a pixel For scaling the normalized coordinates to 3D coordinates in the camera coordinate system; The camera pose matrix is used for converting 3D coordinates in a camera coordinate system into 3D coordinates in a world coordinate system; Step one, three, for all time steps Is (are) back projected 3D pointsPerforming de-duplication and splicing to finally generate a global 3D point cloudThe total point number of the point cloud is typically millions. As a further improvement of the present invention, the specific steps for generating the high-precision 2D target mask in the second step are as follows: Firstly, segmenting a model based on SEMANTICSAM, firstly, ca