CN-121746643-B - Three-dimensional indoor scene generation method and related equipment

CN121746643BCN 121746643 BCN121746643 BCN 121746643BCN-121746643-B

Abstract

The application discloses a three-dimensional indoor scene generation method and related equipment, the method comprises the steps of obtaining text description of an indoor scene, generating indoor layout through a preset diffusion model, converting the indoor layout into a two-dimensional height field and a two-dimensional semantic graph, predicting a nerve radiation field based on the two-dimensional height field and the two-dimensional semantic graph, sampling a plurality of viewpoints in the predicted nerve radiation field, rendering the viewpoints to obtain RGB-D images corresponding to the viewpoints, wherein RGB is color information of the two-dimensional images, D is depth information of the two-dimensional images, constructing truncated character distance fields according to the RGB-D images corresponding to the viewpoints, and generating grid triangular patches based on the truncated character distance fields to obtain the indoor three-dimensional scene. The method can generate the high-quality three-dimensional indoor scene without a real multi-view image, improves the efficiency of generating the three-dimensional scene, and can be widely applied to the technical field of computer vision.

Inventors

GUO YULAN
CHEN MINGLIN
ZHANG YE

Assignees

中山大学

Dates

Publication Date: 20260512
Application Date: 20260225

Claims (8)

1. A three-dimensional indoor scene generation method, comprising: Acquiring text description of an indoor scene; the text description is passed through a preset diffusion model to generate indoor layout, which comprises the steps of coding the text description to obtain text embedded vectors; acquiring a preset initial scene layout; gradually adding Gaussian noise to the initial scene layout through a preset diffusion model until initial scene codes conforming to Gaussian distribution are obtained; based on the text embedded vector and the initial scene code, iterative denoising is carried out on the initial scene code through a preset diffusion model, so that a denoised initial scene code is obtained; Decoding the denoised initial scene code to obtain an indoor layout, wherein the indoor layout comprises a bounding box corresponding to each object and semantic tags, a central position, a size and a rotation angle of the bounding box; Converting the indoor layout into a two-dimensional height field and a two-dimensional semantic graph; Predicting the nerve radiation field based on the two-dimensional height field and the two-dimensional semantic graph to obtain a predicted nerve radiation field; randomly sampling a plurality of viewpoints in the predicted nerve radiation field, and rendering the viewpoints to obtain RGB-D images corresponding to the viewpoints, wherein RGB represents color information of a two-dimensional image, and D represents depth information of the two-dimensional image; Constructing a truncated signed distance field according to the RGB-D image corresponding to each viewpoint, and generating a grid triangular patch based on the truncated signed distance field to obtain an indoor three-dimensional scene; the generating a mesh triangular patch based on the truncated signed distance field includes: The RGB-D images rendered by a plurality of viewpoints are fused into a three-dimensional voxel space frame by frame to construct a truncated signed distance field, wherein the truncated signed distance field comprises each voxel cube and a truncated signed distance field corresponding to each voxel cube; Traversing each voxel cube in the truncated signed distance field, and determining the edge of each object surface traversing each voxel cube according to the signed distance field corresponding to each voxel cube; and calculating the intersection points between the surfaces and the edges of each object through linear interpolation, and generating the grid triangular patches according to the intersection points.
2. The three-dimensional indoor scene production method according to claim 1, wherein the converting the indoor layout into a two-dimensional height field and a two-dimensional semantic map comprises: Determining a projection area of the bounding box of each object on a two-dimensional plane according to the central position and the size of the bounding box of each object; determining a maximum height and a minimum height of the bounding box of each object in the projection area according to the size of the bounding box of each object; Obtaining a two-dimensional height field according to the maximum height and the minimum height; and giving semantic tags of the bounding boxes of each object to corresponding projection areas to obtain a two-dimensional semantic graph.
3. The method of claim 1, wherein predicting the neural radiation field based on the two-dimensional height field and the two-dimensional semantic graph to obtain a predicted neural radiation field comprises: Dividing the indoor layout into a plurality of local areas, and determining a local two-dimensional height field and a local two-dimensional semantic graph corresponding to each local area; Generating local features of each local area according to each local two-dimensional height field and the local two-dimensional semantic graph; and predicting the nerve radiation field according to the local characteristics to obtain a predicted nerve radiation field.
4. The method of generating a three-dimensional indoor scene according to claim 3, wherein predicting the neural radiation field according to the local feature to obtain a predicted neural radiation field comprises: acquiring the position of any three-dimensional sampling point in the indoor layout; determining the local features corresponding to the positions of the three-dimensional sampling points; Calculating and inquiring a hash index of the color characteristic and the volume density characteristic of the three-dimensional sampling point through a hash function by using the position of the three-dimensional sampling point and the local characteristic, wherein the hash index has the expression: ; In the formula, The hash index is represented as such, Representing the position of the three-dimensional sampling point, Representing an exclusive or operation of the bit operation, The prime number is represented by a number of prime numbers, The capacity of the hash table is represented and, Representing the index of the coordinates in three-dimensional space, Representing modulo arithmetic; and extracting color features and volume density features corresponding to the three-dimensional sampling points from the local features according to the hash index to obtain a predicted nerve radiation field.
5. The method according to claim 1, wherein the step of arbitrarily sampling a plurality of viewpoints in the predicted neural radiation field, and rendering a plurality of the viewpoints to obtain an RGB-D image corresponding to each viewpoint, includes: acquiring camera parameters of each viewpoint, and determining camera rays corresponding to each viewpoint according to the camera parameters; performing integral calculation on the predicted nerve radiation field along the camera rays corresponding to each viewpoint to obtain RGB images and depth maps corresponding to each viewpoint; and splicing the RGB image and the depth map to obtain an RGB-D image.
6. A three-dimensional indoor scene generation apparatus, the apparatus comprising: The text description acquisition module is used for acquiring text description of the indoor scene; the indoor layout generation module is used for generating an indoor layout by the text description through a preset diffusion model and comprises the steps of encoding the text description to obtain a text embedded vector; acquiring a preset initial scene layout; gradually adding Gaussian noise to the initial scene layout through a preset diffusion model until initial scene codes conforming to Gaussian distribution are obtained; based on the text embedded vector and the initial scene code, iterative denoising is carried out on the initial scene code through a preset diffusion model, so that a denoised initial scene code is obtained; Decoding the denoised initial scene code to obtain an indoor layout, wherein the indoor layout comprises a bounding box corresponding to each object and semantic tags, a central position, a size and a rotation angle of the bounding box; The layout conversion module is used for converting the indoor layout into a two-dimensional height field and a two-dimensional semantic graph; The nerve radiation field prediction module is used for predicting the nerve radiation field based on the two-dimensional height field and the two-dimensional semantic graph to obtain a predicted nerve radiation field; The rendering module is used for arbitrarily sampling a plurality of viewpoints in the predicted nerve radiation field, rendering the viewpoints to obtain RGB-D images corresponding to the viewpoints, wherein RGB represents color information of the two-dimensional images, and D represents depth information of the two-dimensional images; The triangular patch generation module is used for constructing a truncated signed distance field according to the RGB-D image corresponding to each viewpoint, generating a grid triangular patch based on the truncated signed distance field, and obtaining an indoor three-dimensional scene, wherein the grid triangular patch is generated based on the truncated signed distance field, and comprises the following steps: The RGB-D images rendered by a plurality of viewpoints are fused into a three-dimensional voxel space frame by frame to construct a truncated signed distance field, wherein the truncated signed distance field comprises each voxel cube and a truncated signed distance field corresponding to each voxel cube; Traversing each voxel cube in the truncated signed distance field, and determining the edge of each object surface traversing each voxel cube according to the signed distance field corresponding to each voxel cube; and calculating the intersection points between the surfaces and the edges of each object through linear interpolation, and generating the grid triangular patches according to the intersection points.
7. An electronic device comprising a processor and a memory; the memory is used for storing programs; the processor executes the program to implement the three-dimensional indoor scene generation method according to any one of claims 1 to 5.
8. A computer program product comprising a computer program which, when executed by a processor, implements the three-dimensional indoor scene generation method according to any one of claims 1 to 5.

Description

Three-dimensional indoor scene generation method and related equipment Technical Field The application relates to the technical field of computer vision, in particular to a three-dimensional indoor scene generation method and related equipment. Background In the fields of body intelligence, virtual reality, indoor design and the like, high-realism and structured three-dimensional indoor scene generation is a key basis for constructing a high-efficiency simulation environment and carrying out algorithm training and evaluation. However, in the related art, the conventional indoor three-dimensional scene modeling mainly includes three-dimensional reconstruction based on a real multi-view image and computer-aided design three-dimensional scene modeling. The three-dimensional reconstruction based on the multi-view images is carried out by collecting real multi-view images of the scene under different view angles and carrying out three-dimensional reconstruction according to the real multi-view images, the three-dimensional reconstruction is highly dependent on the fact that a real multi-view image dataset is obtained in advance from the real scene to serve as prior information of the three-dimensional scene, the three-dimensional scene modeling efficiency is affected, a professional modeler is required to spend a great deal of time in designing geometric and texture details in the three-dimensional scene modeling of the computer-aided design, and the modeling efficiency is also affected by the complexity of the scene. Disclosure of Invention The embodiment of the application mainly aims to provide a three-dimensional indoor scene generation method and related equipment, which can generate a high-quality three-dimensional indoor scene without a real multi-view image, and improve the efficiency of generating the three-dimensional scene. In order to achieve the above object, an aspect of an embodiment of the present application provides a three-dimensional indoor scene generating method, including: Acquiring text description of an indoor scene; Generating indoor layout by passing the text description through a preset diffusion model; Converting the indoor layout into a two-dimensional height field and a two-dimensional semantic graph; Predicting the nerve radiation field based on the two-dimensional height field and the two-dimensional semantic graph to obtain a predicted nerve radiation field; randomly sampling a plurality of viewpoints in the predicted nerve radiation field, and rendering the viewpoints to obtain RGB-D images corresponding to the viewpoints, wherein RGB represents color information of a two-dimensional image, and D represents depth information of the two-dimensional image; and constructing a truncated signed distance field according to the RGB-D image corresponding to each viewpoint, and generating a grid triangular surface patch based on the truncated signed distance field to obtain an indoor three-dimensional scene. In some embodiments, the generating the indoor layout by the text description through a preset diffusion model includes: encoding the text description to obtain a text embedded vector; acquiring a preset initial scene layout; gradually adding Gaussian noise to the initial scene layout through a preset diffusion model until initial scene codes conforming to Gaussian distribution are obtained; based on the text embedded vector and the initial scene code, iterative denoising is carried out on the initial scene code through a preset diffusion model, so that a denoised initial scene code is obtained; And decoding the denoised initial scene code to obtain an indoor layout, wherein the indoor layout comprises a bounding box corresponding to each object and semantic tags, center positions, sizes and rotation angles of the bounding boxes. In some embodiments, the converting the indoor layout into a two-dimensional height field and a two-dimensional semantic graph includes: Determining a projection area of the bounding box of each object on a two-dimensional plane according to the central position and the size of the bounding box of each object; determining a maximum height and a minimum height of the bounding box of each object in the projection area according to the size of the bounding box of each object; Obtaining a two-dimensional height field according to the maximum height and the minimum height; and giving semantic tags of the bounding boxes of each object to corresponding projection areas to obtain a two-dimensional semantic graph. In some embodiments, predicting the nerve radiation field based on the two-dimensional height field and the two-dimensional semantic graph, to obtain a predicted nerve radiation field includes: Dividing the indoor layout into a plurality of local areas, and determining a local two-dimensional height field and a local two-dimensional semantic graph corresponding to each local area; Generating local features of each local area according to each local two-dim