CN-121999342-A - Spatial reasoning device, spatial encoder training device and electronic equipment

CN121999342ACN 121999342 ACN121999342 ACN 121999342ACN-121999342-A

Abstract

A spatial inference apparatus, a spatial encoder training apparatus, and an electronic device are disclosed. The spatial reasoning device comprises one or more processors, the one or more processors are configured to acquire a task instruction and an image to be processed corresponding to a first view angle, the task instruction is used for acquiring global space description of a target object in the image to be processed, the image to be processed is processed to acquire visual features and local space features corresponding to the first view angle, the local space features are processed through a pre-trained spatial encoder to acquire global space features of the image to be processed in a three-dimensional space, and the visual features, the global space features and the task instruction are processed to acquire global space description of the target object in the three-dimensional space. By adopting the method and the device, the accuracy and the robustness of three-dimensional space understanding and reasoning can be improved.

Inventors

JIANG HAOYI
LIU LIU
WANG XINJIE
SUI WEI
SU ZHIZHONG
WANG XINGGANG

Assignees

深圳地瓜机器人有限公司
北京箩卜的壳科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260129

Claims (16)

1. A spatial reasoning apparatus includes one or more processors, the one or more processors are configured to: Acquiring a task instruction and an image to be processed corresponding to a first visual angle, wherein the task instruction is used for acquiring global space description of a target object in the image to be processed; processing the image to be processed to obtain visual features and local spatial features corresponding to the first visual angle; Processing the local spatial features through a pre-trained spatial encoder to obtain global spatial features of the image to be processed in a three-dimensional space; and processing the visual features, the global space features and the task instructions to obtain global space description of the target object in the three-dimensional space.
2. The apparatus of claim 1, wherein the processing the local spatial features by a pre-trained spatial encoder to obtain global spatial features of the image to be processed in three dimensions comprises: And constructing query features by the pre-trained space encoder according to preset space query features, constructing key value features by the local space features, and performing feature fusion processing on the space query features and the local space features to obtain global space features of the image to be processed in a three-dimensional space.
3. The apparatus of claim 1, wherein the processing the visual feature, the global spatial feature, and the task instruction to obtain a global spatial description of the target object in the three-dimensional space comprises: performing feature fusion processing on the visual features and the global space features to obtain fusion features; performing feature extraction processing on the task instruction to obtain semantic features corresponding to the task instruction; And carrying out space reasoning processing on the fusion features based on the semantic features to obtain global space description of the target object corresponding to the semantic features in the three-dimensional space.
4. The apparatus of claim 3, wherein the feature fusion process for the visual feature and the global spatial feature to obtain a fused feature comprises: And constructing query features by the visual features, constructing key value features by the global space features, and carrying out feature fusion processing on the visual features and the global space features to obtain the fusion features.
5. A spatial encoder training apparatus includes one or more processors, the one or more processors are configured to: Acquiring a first image corresponding to a reference view angle and a second image corresponding to a target view angle; Processing the first image and the second image to obtain a first spatial feature corresponding to the reference view angle and a second spatial feature corresponding to the target view angle; Performing feature fusion processing on a preset space query feature and the first space feature through a space encoder to be trained to obtain a global space feature of the first image in a three-dimensional space; Performing camera pose estimation on the second spatial features to obtain camera poses corresponding to the target visual angles; Processing the global spatial feature and the camera pose to obtain a third spatial feature corresponding to the target visual angle; And performing model parameter optimization on the space encoder to be trained based on the second space feature and the third space feature to obtain a trained space encoder.
6. The apparatus of claim 5, wherein the processing the first image and the second image to obtain a first spatial feature corresponding to the reference view angle and a second spatial feature corresponding to the target view angle comprises: Performing feature extraction processing on the first image to obtain a first visual feature corresponding to the reference visual angle; performing feature extraction processing on the second image to obtain a second visual feature corresponding to the target visual angle; And carrying out feature fusion processing on the first visual features and the second visual features based on a preset asymmetric mask to obtain a first spatial feature corresponding to the reference view angle and a second spatial feature corresponding to the target view angle.
7. The apparatus of claim 6, wherein the asymmetric mask comprises a first retention mask corresponding to the first visual feature and a second retention mask corresponding to the second visual feature; The feature fusion processing is performed on the first visual feature and the second visual feature based on a preset asymmetric mask, so as to obtain a first spatial feature corresponding to the reference view angle and a second spatial feature corresponding to the target view angle, including: based on the first retention mask, constructing query features and key value features by the first visual features, and performing feature fusion processing on the first visual features to obtain first spatial features corresponding to the reference viewing angles; And constructing query features with the second visual features based on the first reserved mask and the second reserved mask, constructing key value features with the second visual features and the first visual features, and performing feature fusion processing on the second visual features and the first visual features to obtain second spatial features corresponding to the target viewing angle.
8. The apparatus of claim 5, wherein the performing, by the spatial encoder to be trained, feature fusion processing on the preset spatial query feature and the first spatial feature to obtain a global spatial feature of the first image in a three-dimensional space, includes: and constructing query features by the space encoder to be trained according to the preset space query features, constructing key value features by the first space features, and performing feature fusion processing on the space query features and the first space features to obtain global space features of the first image in a three-dimensional space.
9. The apparatus of claim 5, wherein the processing the global spatial feature and the camera pose to obtain a third spatial feature corresponding to the target perspective comprises: generating a light query feature corresponding to the target visual angle based on the camera pose; And constructing query features by the light query features, constructing key value features by the global space features, and carrying out feature fusion processing on the light query features and the global space features to obtain third space features corresponding to the target viewing angles.
10. A method of spatial reasoning, comprising: Acquiring a task instruction and an image to be processed corresponding to a first visual angle, wherein the task instruction is used for acquiring global space description of a target object in the image to be processed; processing the image to be processed to obtain visual features and local spatial features corresponding to the first visual angle; Processing the local spatial features through a pre-trained spatial encoder to obtain global spatial features of the image to be processed in a three-dimensional space; and processing the visual features, the global space features and the task instructions to obtain global space description of the target object in the three-dimensional space.
11. The method according to claim 10, wherein the processing the local spatial features by a pre-trained spatial encoder to obtain global spatial features of the image to be processed in three dimensions comprises: And constructing query features by the pre-trained space encoder according to preset space query features, constructing key value features by the local space features, and performing feature fusion processing on the space query features and the local space features to obtain global space features of the image to be processed in a three-dimensional space.
12. The method of claim 10, wherein the processing the visual features, the global spatial features, and the task instructions to obtain a global spatial description of the target object in the three-dimensional space comprises: performing feature fusion processing on the visual features and the global space features to obtain fusion features; performing feature extraction processing on the task instruction to obtain semantic features corresponding to the task instruction; And carrying out space reasoning processing on the fusion features based on the semantic features to obtain global space description of the target object corresponding to the semantic features in the three-dimensional space.
13. The method of claim 12, wherein the feature fusion process of the visual feature and the global spatial feature to obtain a fused feature comprises: And constructing query features by the visual features, constructing key value features by the global space features, and carrying out feature fusion processing on the visual features and the global space features to obtain the fusion features.
14. A spatial encoder training method, comprising: Acquiring a first image corresponding to a reference view angle and a second image corresponding to a target view angle; Processing the first image and the second image to obtain a first spatial feature corresponding to the reference view angle and a second spatial feature corresponding to the target view angle; Performing feature fusion processing on a preset space query feature and the first space feature through a space encoder to be trained to obtain a global space feature of the first image in a three-dimensional space; Performing camera pose estimation on the second spatial features to obtain camera poses corresponding to the target visual angles; Processing the global spatial feature and the camera pose to obtain a third spatial feature corresponding to the target visual angle; And performing model parameter optimization on the space encoder to be trained based on the second space feature and the third space feature to obtain a trained space encoder.
15. An electronic device comprising an apparatus as claimed in any one of claims 1 to 4, or 5 to 9, or The electronic device includes a processor and a memory for storing instructions executable by the processor; the processor being configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of the preceding claims 10 to 13, or 14.
16. A computer readable storage medium storing a computer program for execution by a processor for performing the method of any one of the preceding claims 10 to 13, or 14.

Description

Spatial reasoning device, spatial encoder training device and electronic equipment Technical Field The disclosure relates to the field of computer technology, and in particular relates to a spatial reasoning device, a spatial encoder training device and electronic equipment. Background In recent years, visual language models (Vision Language Models, VLMs) have made significant progress in two-dimensional visual reasoning tasks such as image description, visual question-answering, cross-modal retrieval, and the like. First, the visual language model extracts visual features of an input image through a visual encoder, and extracts semantic features of an input text through a large language model. And then, the visual language model maps the visual features and the semantic features to a unified potential space through a multi-modal fusion mechanism, so that alignment and interaction of the cross-modal features are realized. And then, performing end-to-end optimization on the visual language model through instruction fine adjustment of a pre-training target or a downstream task, so as to execute two-dimensional visual reasoning tasks such as image description, visual question-answering, cross-modal retrieval and the like. However, the visual encoder of the existing visual language model is mainly based on two-dimensional images for feature extraction, focuses on visual characterization of appearance and semantic layers, and has limited modeling capability on spatial features such as depth information, three-dimensional geometric structures, spatial topological relations and the like of a scene. Therefore, in the task of understanding and reasoning in three-dimensional space, the accuracy and the robustness of the space reasoning of the existing visual language model still have the defects. Disclosure of Invention In order to solve the technical problems, the disclosure provides a spatial reasoning device, a spatial encoder training device and an electronic device, so as to improve accuracy and robustness of three-dimensional spatial understanding and reasoning. In a first aspect embodiment of the present disclosure, there is provided a spatial reasoning apparatus comprising one or more processors configured to: Acquiring a task instruction and an image to be processed corresponding to a first visual angle, wherein the task instruction is used for acquiring global space description of a target object in the image to be processed; processing the image to be processed to obtain visual features and local spatial features corresponding to the first visual angle; Processing the local spatial features through a pre-trained spatial encoder to obtain global spatial features of the image to be processed in a three-dimensional space; and processing the visual features, the global space features and the task instructions to obtain global space description of the target object in the three-dimensional space. In a second aspect of the present disclosure, there is provided a spatial encoder training apparatus comprising one or more processors configured to: Acquiring a first image corresponding to a reference view angle and a second image corresponding to a target view angle; Processing the first image and the second image to obtain a first spatial feature corresponding to the reference view angle and a second spatial feature corresponding to the target view angle; Performing feature fusion processing on a preset space query feature and the first space feature through a space encoder to be trained to obtain a global space feature of the first image in a three-dimensional space; Performing camera pose estimation on the second spatial features to obtain camera poses corresponding to the target visual angles; Processing the global spatial feature and the camera pose to obtain a third spatial feature corresponding to the target visual angle; And performing model parameter optimization on the space encoder to be trained based on the second space feature and the third space feature to obtain a trained space encoder. In a third aspect of the present disclosure, there is provided a spatial reasoning method, including: Acquiring a task instruction and an image to be processed corresponding to a first visual angle, wherein the task instruction is used for acquiring global space description of a target object in the image to be processed; processing the image to be processed to obtain visual features and local spatial features corresponding to the first visual angle; Processing the local spatial features through a pre-trained spatial encoder to obtain global spatial features of the image to be processed in a three-dimensional space; and processing the visual features, the global space features and the task instructions to obtain global space description of the target object in the three-dimensional space. In a fourth aspect of the present disclosure, there is provided a spatial encoder training method, including: Acquiring a first imag