CN-121982214-A - Training method, device, equipment and storage medium of three-dimensional semantic reconstruction model

CN121982214ACN 121982214 ACN121982214 ACN 121982214ACN-121982214-A

Abstract

The application discloses a training method, device and equipment of a three-dimensional semantic reconstruction model and a storage medium, and relates to the technical field of computer vision. The method comprises the steps of obtaining a training sample of a three-dimensional semantic reconstruction model, wherein the training sample comprises a target view image and at least one auxiliary view image which belong to the same scene, performing first-stage training on the three-dimensional semantic reconstruction model based on the target view image and the at least one auxiliary view image to obtain a pre-trained three-dimensional semantic reconstruction model, and performing second-stage training on the pre-trained three-dimensional semantic reconstruction model based on the target view image and the at least one auxiliary view image to obtain a trained three-dimensional semantic reconstruction model. The method realizes a hybrid training strategy of transition from two-dimensional semantic supervision to three-dimensional semantic enhancement, and effectively improves the cross-view similarity and generalization of the three-dimensional semantic reconstruction model.

Inventors

Request for anonymity
Request for anonymity

Assignees

摩尔线程智能科技(北京)股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260128

Claims (16)

1. A method for training a three-dimensional semantic reconstruction model, the method comprising: Acquiring a training sample of the three-dimensional semantic reconstruction model, wherein the training sample comprises a target view angle image and at least one auxiliary view angle image which belong to the same scene, and the target view angle image and the at least one auxiliary view angle image respectively correspond to different view angles of the scene; Performing a first-stage training on the three-dimensional semantic reconstruction model based on the target view image and the at least one auxiliary view image to obtain a pre-trained three-dimensional semantic reconstruction model, wherein the first-stage training is used for lifting two-dimensional semantic information in each auxiliary view image into a three-dimensional Gaussian representation of the three-dimensional semantic reconstruction model; And performing second-stage training on the pre-trained three-dimensional semantic reconstruction model based on the target view angle image and the at least one auxiliary view angle image to obtain a trained three-dimensional semantic reconstruction model, wherein the second-stage training is used for improving the similarity of three-dimensional semantic features of the pre-trained three-dimensional semantic reconstruction model under different view angles.
2. The method of claim 1, wherein the performing a second stage training on the pre-trained three-dimensional semantic reconstruction model based on the target perspective image and the at least one auxiliary perspective image results in a trained three-dimensional semantic reconstruction model, comprising: Generating a Gaussian primitive set of each auxiliary view angle image through the pre-trained three-dimensional semantic reconstruction model, wherein the Gaussian primitive set of each auxiliary view angle image comprises a Gaussian primitive parameter set of at least one pixel point included in the auxiliary view angle image, and the Gaussian primitive parameter set of the pixel point is used for indicating the attribute of one Gaussian primitive in a three-dimensional Gaussian field corresponding to the scene; Generating a region mask of the target view image and a region mask of each auxiliary view image based on the target view image and a gaussian primitive set of each auxiliary view image, wherein the region mask of the target view image is used for indicating an object region including a target object in the target view image, the region mask of the auxiliary view image is used for indicating an object region including the target object in the auxiliary view image, and the target object is used for aligning three-dimensional semantic information under different views in the three dimension Gao Sichang; And adjusting parameters of the pre-trained three-dimensional semantic reconstruction model based on the target view image, the regional mask of each auxiliary view image and the Gaussian primitive set of each auxiliary view image to obtain the trained three-dimensional semantic reconstruction model.
3. The method according to claim 2, wherein said adjusting parameters of the pre-trained three-dimensional semantic reconstruction model based on the target view image, the region mask of each of the auxiliary view images, and the gaussian primitive set of each of the auxiliary view images, to obtain the trained three-dimensional semantic reconstruction model comprises: determining a semantic loss value based on the target view image and the Gaussian primitive set of each auxiliary view image, wherein the semantic loss value is used for indicating the degree of difference between two-dimensional semantic information and three-dimensional semantic information of the same view of the scene; Determining a consistency loss value based on the region mask of the target view image, the region mask of each auxiliary view image and the Gaussian primitive set of each auxiliary view image, wherein the consistency loss value is used for indicating three-dimensional semantic information corresponding to the object region of the target view image and the degree of difference between the three-dimensional semantic information corresponding to the object region of each auxiliary view image; Determining prototype alignment loss values based on the region mask of each of the auxiliary view images and the gaussian primitive set of each of the auxiliary view images, the prototype alignment loss values being used to indicate semantic stability of the target object in the three dimensions Gao Sichang; And adjusting parameters of the pre-trained three-dimensional semantic reconstruction model based on the semantic loss value, the consistency loss value and the prototype alignment loss value to obtain the trained three-dimensional semantic reconstruction model.
4. The method of claim 3, wherein the determining a consistency loss value based on the region mask of the target view image, the region mask of each of the auxiliary view images, and the gaussian primitive set of each of the auxiliary view images comprises: determining object three-dimensional semantic features of the target view images based on the area masks of the target view images and the Gaussian primitive sets of the auxiliary view images, wherein the object three-dimensional semantic features of the target view images are used for indicating three-dimensional feature information corresponding to object areas of the target view images; For each auxiliary view image, determining object three-dimensional semantic features of the auxiliary view image based on the area mask of the auxiliary view image and the Gaussian primitive set of the auxiliary view image, wherein the object three-dimensional semantic features of the auxiliary view image are used for indicating three-dimensional feature information corresponding to an object area of the sample view; and determining the consistency loss value based on the object three-dimensional semantic features of the target view images and the object three-dimensional semantic features of the auxiliary view images.
5. The method of claim 4, wherein the region mask of the auxiliary view image is used to indicate at least one pixel belonging to the object region of the auxiliary view image, and the gaussian primitive parameter set of the pixel includes a semantic feature vector, and the semantic feature vector is used to indicate three-dimensional semantic information of a gaussian primitive corresponding to the pixel; The determining the three-dimensional semantic feature of the object of the auxiliary view image based on the region mask of the auxiliary view image and the gaussian primitive set of the auxiliary view image comprises: and determining the three-dimensional semantic features of the object of the auxiliary view image based on semantic feature vectors of all pixel points belonging to the object area of the auxiliary view image.
6. A method according to claim 3, wherein the gaussian primitive parameter set of the pixel point includes a three-dimensional center position for indicating a position of a gaussian primitive corresponding to the pixel point in the three-dimension Gao Sichang and a semantic feature vector for indicating three-dimensional semantic information of the gaussian primitive corresponding to the pixel point; The determining a prototype alignment loss value based on the region mask of each of the auxiliary view images and the gaussian primitive set of each of the auxiliary view images comprises: Determining target pixel points belonging to the target object from at least one pixel point included in each auxiliary view angle image based on the area mask of each auxiliary view angle image to obtain at least one target pixel point; Determining a region prototype vector of the target object based on the semantic feature vector of the at least one target pixel point, the region prototype vector being used to indicate the position of the target object in the three dimensions Gao Sichang; Based on the three-dimensional center position of the at least one target pixel point, respectively determining an updating weight value of the at least one target pixel point, wherein the updating weight value of the target pixel point is used for indicating the influence degree of the target pixel point on the parameters of the three-dimensional semantic reconstruction model; And determining the prototype alignment loss value based on the region prototype vector, the updated weight value of the at least one target pixel point and the semantic feature vector of the at least one target pixel point.
7. The method of claim 6, wherein the determining the updated weight values of the at least one target pixel point based on the three-dimensional center position of the at least one target pixel point, respectively, comprises: Determining a three-dimensional average distance of each target pixel point based on the three-dimensional center position of the at least one target pixel point, wherein the three-dimensional average distance of the target pixel point is used for indicating the average distance between a Gaussian primitive corresponding to the target pixel point and Gaussian primitives corresponding to other target pixel points in the three-dimensional Gaussian field; determining a total three-dimensional average distance based on the three-dimensional average distance of each target pixel point, wherein the total three-dimensional average distance is obtained by adding the three-dimensional average distances of each target pixel point; For each target pixel point, determining an update weight value of the target pixel point based on the three-dimensional average distance of the target pixel point and the total three-dimensional average distance.
8. A method according to claim 3, wherein said determining semantic loss values based on the set of gaussian primitives for the target view image and each of the auxiliary view images comprises: determining rendering semantic features of a rendering image based on the Gaussian primitive sets of the auxiliary view images, wherein the rendering semantic features of the rendering image are used for indicating three-dimensional semantic information corresponding to the rendering image, the rendering image is rendered based on the Gaussian primitive sets of the auxiliary view images, and the rendering image and the target view image correspond to the same view in the scene; Extracting two-dimensional semantic features of the target view angle image, wherein the two-dimensional semantic features of the target view angle image are used for indicating two-dimensional semantic information of the target view angle image; And determining the semantic loss value based on the rendering semantic features of the rendering image and the two-dimensional semantic features of the target view image.
9. The method of claim 8, wherein the generating the region mask for the target view image and the region mask for each of the auxiliary view images based on the target view image and the gaussian primitive set for each of the auxiliary view images comprises: determining a target prompt point based on the rendering semantic features of the rendering image and the two-dimensional semantic features of the target view image, wherein the target prompt point is a pixel point representing the target object in the target view image; Based on the target cue points, a region mask of the target view image and a region mask of each of the auxiliary view images are generated.
10. The method of claim 9, wherein the determining a target cue point based on the rendered semantic features of the rendered image and the two-dimensional semantic features of the target perspective image comprises: determining a semantic error map based on the rendered semantic features of the rendered image and the two-dimensional semantic features of the target perspective image, the semantic error map being used to indicate semantic differences of the rendered image and the target perspective image; and determining the target prompt point based on the semantic error map.
11. The method of claim 10, wherein the semantic error map includes semantic difference values of at least one pixel included in the target perspective image, the semantic difference values of the pixel being used to indicate differences in two-dimensional semantic information corresponding to the pixel in the target perspective image and three-dimensional semantic information corresponding to the pixel in the rendered semantic feature; the determining the target prompting point based on the semantic error map comprises the following steps: and determining the pixel point with the largest semantic difference value in the semantic error map as the target prompt point.
12. The method of claim 1, wherein the performing a first stage of training on the three-dimensional semantic reconstruction model based on the target perspective image and the at least one auxiliary perspective image results in a pre-trained three-dimensional semantic reconstruction model, comprising: Generating a Gaussian primitive set of each auxiliary view image through the three-dimensional semantic reconstruction model, wherein the Gaussian primitive set of each auxiliary view image comprises a Gaussian primitive parameter set of at least one pixel point included in the auxiliary view image, and the Gaussian primitive parameter set of the pixel point is used for indicating the attribute of one Gaussian primitive in a three-dimensional Gaussian field corresponding to the scene; determining rendering semantic features of a rendering image based on the Gaussian primitive sets of the auxiliary view images, wherein the rendering semantic features of the rendering image are used for indicating three-dimensional semantic information corresponding to the rendering image, the rendering image is rendered based on the Gaussian primitive sets of the auxiliary view images, and the rendering image and the target view image correspond to the same view in the scene; Extracting two-dimensional semantic features of the target view angle image, wherein the two-dimensional semantic features of the target view angle image are used for indicating two-dimensional semantic information of the target view angle image; determining a first loss function value based on the rendered semantic features of the rendered image and the two-dimensional semantic features of the target perspective image; And adjusting parameters of the three-dimensional semantic reconstruction model based on the first loss function value to obtain the pre-trained three-dimensional semantic reconstruction model.
13. A training device for a three-dimensional semantic reconstruction model, the device comprising: The acquisition module is used for acquiring a training sample of the three-dimensional semantic reconstruction model, wherein the training sample comprises a target view angle image and at least one auxiliary view angle image which belong to the same scene, and the target view angle image and the at least one auxiliary view angle image respectively correspond to different view angles of the scene; The first training module is used for performing first-stage training on the three-dimensional semantic reconstruction model based on the target view angle image and the at least one auxiliary view angle image to obtain a pre-trained three-dimensional semantic reconstruction model, wherein the first-stage training is used for lifting two-dimensional semantic information in each auxiliary view angle image into a three-dimensional Gaussian representation of the three-dimensional semantic reconstruction model; The second training module is used for executing second-stage training on the pre-trained three-dimensional semantic reconstruction model based on the target view angle image and the at least one auxiliary view angle image to obtain a trained three-dimensional semantic reconstruction model, wherein the second-stage training is used for improving the similarity of three-dimensional semantic features of the pre-trained three-dimensional semantic reconstruction model under different view angles.
14. A computer device comprising a processor and a memory, the memory having stored therein a computer program that is loaded and executed by the processor to implement the method of any of claims 1 to 12.
15. A computer readable storage medium, characterized in that the storage medium has stored therein a computer program for execution by a processor for implementing the method of any one of claims 1 to 12.
16. A computer program product, characterized in that the computer program product comprises a computer program that is loaded and executed by a processor to implement the method of any one of claims 1 to 12.

Description

Training method, device, equipment and storage medium of three-dimensional semantic reconstruction model Technical Field The embodiment of the application relates to the technical field of computer vision, in particular to a training method, device and equipment of a three-dimensional semantic reconstruction model and a storage medium. Background In recent years, with rapid development of computer vision technology, there is a need for three-dimensional scene reconstruction in many fields. For example, in the field of building design, it is desirable to digitally model a building or a complex structure of a scene in which the building is located by three-dimensional reconstruction. In the related technology, two-dimensional semantic features of the auxiliary view angle image are extracted through an image processing model, then the two-dimensional semantic features of the auxiliary view angle image are used as supervision signals, and parameters of the three-dimensional semantic reconstruction model are adjusted to obtain the trained three-dimensional semantic reconstruction model. In the related art, for the situation that two-dimensional semantic features of the same object in a scene in images of different auxiliary viewing angles are inconsistent, the inconsistent two-dimensional semantic features may cause a problem that the accuracy of the trained three-dimensional semantic reconstruction model is low. Disclosure of Invention The embodiment of the application provides a training method, a training device, training equipment and training storage media for a three-dimensional semantic reconstruction model. The technical scheme provided by the embodiment of the application is as follows: according to an aspect of an embodiment of the present application, there is provided a training method of a three-dimensional semantic reconstruction model, the method including: Acquiring a training sample of the three-dimensional semantic reconstruction model, wherein the training sample comprises a target view angle image and at least one auxiliary view angle image which belong to the same scene, and the target view angle image and the at least one auxiliary view angle image respectively correspond to different view angles of the scene; Performing a first-stage training on the three-dimensional semantic reconstruction model based on the target view image and the at least one auxiliary view image to obtain a pre-trained three-dimensional semantic reconstruction model, wherein the first-stage training is used for lifting two-dimensional semantic information in each auxiliary view image into a three-dimensional Gaussian representation of the three-dimensional semantic reconstruction model; And performing second-stage training on the pre-trained three-dimensional semantic reconstruction model based on the target view angle image and the at least one auxiliary view angle image to obtain a trained three-dimensional semantic reconstruction model, wherein the second-stage training is used for improving the similarity of three-dimensional semantic features of the pre-trained three-dimensional semantic reconstruction model under different view angles. According to an aspect of an embodiment of the present application, there is provided a training apparatus for a three-dimensional semantic reconstruction model, the apparatus including: The acquisition module is used for acquiring a training sample of the three-dimensional semantic reconstruction model, wherein the training sample comprises a target view angle image and at least one auxiliary view angle image which belong to the same scene, and the target view angle image and the at least one auxiliary view angle image respectively correspond to different view angles of the scene; The first training module is used for performing first-stage training on the three-dimensional semantic reconstruction model based on the target view angle image and the at least one auxiliary view angle image to obtain a pre-trained three-dimensional semantic reconstruction model, wherein the first-stage training is used for lifting two-dimensional semantic information in each auxiliary view angle image into a three-dimensional Gaussian representation of the three-dimensional semantic reconstruction model; The second training module is used for executing second-stage training on the pre-trained three-dimensional semantic reconstruction model based on the target view angle image and the at least one auxiliary view angle image to obtain a trained three-dimensional semantic reconstruction model, wherein the second-stage training is used for improving the similarity of three-dimensional semantic features of the pre-trained three-dimensional semantic reconstruction model under different view angles. According to an aspect of an embodiment of the present application, there is provided a computer device including a processor and a memory, in which a computer program is stored, the computer program being load