CN-121788688-B - Text-driven 3D scene stylization method, device, equipment and storage medium

CN121788688BCN 121788688 BCN121788688 BCN 121788688BCN-121788688-B

Abstract

The invention relates to the technical field of 3D scene stylization, in particular to a text-driven 3D scene stylization method, a device, equipment and a storage medium, wherein the method firstly constructs a mixed radial base field representation framework, then constructs an initial style radial basis function model based on the mixed radial base field representation framework, carries out multi-round iterative training on the initial style radial basis function model to obtain a style radial basis function model, then obtains multi-dimensional loss, the method comprises the steps of fine tuning a style radial basis function model based on multidimensional loss to obtain a target style radial basis function model, finally obtaining a text instruction, and driving the target style radial basis function model based on the text instruction to generate a 3D scene stylized rendering result, so as to solve the problems of fuzzy artifact and uneven style coverage in the existing text-guided 3D scene stylized technology, and realize 3D scene stylized with consistent cross-view angle, no artifact, high fidelity and uniform style.

Inventors

WU XINGCHEN

Assignees

季华实验室

Dates

Publication Date: 20260508
Application Date: 20260304

Claims (8)

1. A method for text-driven 3D scene stylization, comprising: The method comprises the steps of obtaining a learnable parameter set, defining a self-adaptive radial basis function kernel set based on the learnable parameter set, performing kernel capacity expansion processing on the self-adaptive radial basis function kernel set by adopting a multi-frequency sinusoidal synthesis algorithm to obtain an optimized radial basis function kernel set, constructing a coarse grid backbone network, forming a complementary feature extraction branch with the optimized radial basis function kernel set, and setting a neural network decoder to complete the construction of a mixed radial basis field representation architecture; Acquiring an actual scene multi-view image and a preset weight distribution rule; acquiring a plurality of initial 3D sampling points based on the actual scene multi-view image, and carrying out weight distribution on each initial 3D sampling point based on the weight distribution rule to obtain weighted 3D sampling points; initializing the position and shape of the optimized radial basis function kernel group based on a plurality of clustering centers to construct an initial style radial basis function model; performing multiple rounds of iterative training on the initial style radial basis function model to obtain a style radial basis function model; acquiring multi-dimensional loss, and fine-tuning the style radial basis function model based on the multi-dimensional loss to obtain a target style radial basis function model; And acquiring a text instruction, and driving the target style radial basis function model to generate a 3D scene stylized rendering result based on the text instruction.
2. The text-driven 3D scene stylization method of claim 1, wherein the set of learnable parameters further comprises feature weights, wherein the performing multiple rounds of iterative training on the initial style radial basis function model to obtain a style radial basis function model comprises: Obtaining camera light data based on the actual scene multi-view image sampling, and obtaining actual color data corresponding to the camera light data; inputting the camera ray data into the initial style radial basis function model for volume rendering to obtain predicted color data; calculating a model reconstruction loss based on the actual color data and the predicted color data; based on the model reconstruction loss, adopting a back propagation algorithm to iteratively update the feature weights of the optimized radial basis function kernel group of the initial style radial basis function model; and when the preset iteration stop condition is met, outputting the style radial basis function model.
3. The text-driven 3D scene stylization method of claim 2, wherein said obtaining a multi-dimensional loss, fine tuning the style radial basis function model based on the multi-dimensional loss, obtaining a target style radial basis function model, comprises: Inputting the camera ray data into the style radial basis function model for volume rendering to obtain stylized predicted color data, and performing splicing processing on the stylized predicted color data to obtain a stylized scene image; acquiring a target text corresponding to the stylized scene image and a fixed text corresponding to the actual scene multi-view image; invoking a pre-trained CLIP model, and performing multi-dimensional loss calculation based on the stylized scene image, the actual scene multi-view image, the target text and the fixed text to obtain the multi-dimensional loss; Based on the multi-dimensional loss, adopting a back propagation algorithm to iteratively update the feature weights of the pre-trained radial basis function kernel group of the style radial basis function model; and when a preset iteration stop condition is met, outputting the target style radial basis function model.
4. The method of claim 3, wherein the CLIP model comprises a text encoder, an image encoder, and a multi-dimensional loss calculation module, the text encoder and the image encoder being respectively coupled to the multi-dimensional loss calculation module, wherein invoking the pre-trained CLIP model performs multi-dimensional loss calculation based on the stylized scene image, the actual scene multi-view image, the target text, and the fixed text to obtain a multi-dimensional loss, comprising: Acquiring a pre-training radial basis function kernel group based on the style radial basis function model, performing camera plane projection processing on the pre-training radial basis function kernel group to obtain an elliptical coverage area, and generating a scene saliency map based on the elliptical coverage area; Constructing an offset sampling rule based on the scene saliency map, randomly sampling the stylized scene image based on the offset sampling rule to obtain a stylized local image block set, and acquiring an original local image block set corresponding to the stylized local image block set from the actual scene multi-view image; respectively carrying out text coding on the target text and the fixed text based on the text coder to obtain a target text embedded vector and a fixed text embedded vector; Respectively carrying out image coding on the stylized scene image, the actual scene multi-view image, the stylized local image block set and the original local image block set based on the image coder to obtain a stylized image embedded vector, an original image embedded vector, a stylized image block embedded vector and an original image block embedded vector; And carrying out multi-dimensional loss calculation according to the target text embedded vector, the fixed text embedded vector, the stylized image embedded vector, the original image embedded vector, the stylized image block embedded vector and the original image block embedded vector by the multi-dimensional loss calculation module to obtain the multi-dimensional loss.
5. The method of claim 4, wherein the multi-dimensional loss calculation module includes a relative directionality loss calculation sub-module, a global contrast loss calculation sub-module, a local contrast loss calculation sub-module, and an auxiliary loss calculation sub-module, the relative directionality loss calculation sub-module, the global contrast loss calculation sub-module, and the local contrast loss calculation sub-module are respectively connected to the text encoder and the image encoder at the same time, the auxiliary loss calculation sub-module is connected to the image encoder, and the multi-dimensional loss calculation module obtains the multi-dimensional loss by performing multi-dimensional loss calculation according to the target text embedding vector, the fixed text embedding vector, the stylized image embedding vector, the original image embedding vector, the stylized image block embedding vector, and the original image block embedding vector, and the multi-dimensional loss calculation module includes: calculating a relative directionality loss from the target text embedded vector, the fixed text embedded vector, the stylized image embedded vector, and the original image embedded vector based on the relative directionality loss calculation sub-module; calculating global contrast loss according to the target text embedded vector, the stylized image embedded vector and the original image embedded vector based on the global contrast loss calculation sub-module; Calculating local contrast loss according to the stylized image block embedding vector and the original image block embedding vector based on the local contrast loss calculation sub-module; calculating auxiliary loss according to the stylized scene image and the actual scene multi-view image based on the auxiliary loss calculation sub-module; And acquiring a preset weight coefficient setting rule, and carrying out weighted summation on the relative directivity loss, the global contrast loss, the local contrast loss and the auxiliary loss based on the weight coefficient setting rule to obtain the multidimensional loss.
6. A text-driven 3D scene stylizing apparatus, comprising: The architecture construction module is used for acquiring a leachable parameter set, defining a self-adaptive radial basis function kernel set based on the leachable parameter set, wherein the leachable parameter set comprises a position and a shape, performing kernel capacity expansion processing on the self-adaptive radial basis function kernel set by adopting a multi-frequency sinusoidal synthesis algorithm to obtain an optimized radial basis function kernel set, constructing a coarse grid backbone network, forming a complementary feature extraction branch with the optimized radial basis function kernel set, and setting a neural network decoder to complete the construction of a mixed radial basis field representation architecture; the model construction module is used for acquiring an actual scene multi-view image and a preset weight distribution rule, acquiring a plurality of initial 3D sampling points based on the actual scene multi-view image, and carrying out weight distribution on each initial 3D sampling point based on the weight distribution rule to obtain weighted 3D sampling points; The model training module is used for carrying out multi-round iterative training on the initial style radial basis function model to obtain a style radial basis function model; The model fine tuning module is used for obtaining a pre-constructed multi-dimensional loss constraint system, and carrying out fine tuning on the style radial basis function model based on the multi-dimensional loss constraint system to obtain a target style radial basis function model; the instruction driving module is used for obtaining a text instruction and driving the target style radial basis function model to generate a 3D scene stylized rendering result based on the text instruction.
7. A text-driven 3D scene stylizing device, the text-driven 3D scene stylizing device comprising a memory and at least one processor, the memory having instructions stored therein; At least one of the processors invokes the instructions in the memory to cause the text-driven 3D scene stylizing device to perform the steps of the text-driven 3D scene stylizing method as recited in any of claims 1-5.
8. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the steps of the text-driven 3D scene stylization method of any of claims 1-5.

Description

Text-driven 3D scene stylization method, device, equipment and storage medium Technical Field The present invention relates to the field of 3D scene stylization technologies, and in particular, to a method, an apparatus, a device, and a storage medium for text-driven 3D scene stylization. Background In the field of digital content authoring, intuitive and consistent three-dimensional content generation across perspectives has been a central need, and the advent of NeRF (neuro-radiation field) editing technology has provided a viable path for this goal. The technology can realize three-dimensional scene editing based on text guidance by learning the radiation field information of the scene, is hopeful to break the limitation of complex and low efficiency of the traditional three-dimensional creation process, and has wide application prospects in multiple fields such as film and television production, game development, virtual simulation and the like. However, the existing text-guided editing scheme based on NeRF still has obvious limitations, and it is difficult to combine the flexibility of creation and the stability of effect. While another type of scheme allows the geometric body to move to adapt to style requirements, the scheme is easy to cause serious technical problems, not only can fuzzy artifacts be generated and visual fidelity be destroyed, but also a phenomenon of uneven style coverage can occur, so that the style effect of a scene partial region is lost or excessive, and the overall consistency is influenced. Disclosure of Invention In order to overcome the defects of the prior art, the invention aims to provide a text-driven 3D scene stylizing method, a device, equipment and a storage medium, which aim to solve the problems of fuzzy artifact and uneven style coverage in the existing text-guided 3D scene stylizing technology and realize 3D scene stylizing with consistent cross-view angle, no artifact, high fidelity and uniform style. The invention provides a text-driven 3D scene stylization method which comprises the steps of constructing a mixed radial base field representation framework, constructing an initial style radial base function model based on the mixed radial base field representation framework, carrying out multi-round iterative training on the initial style radial base function model to obtain a style radial base function model, obtaining multi-dimensional loss, carrying out fine adjustment on the style radial base function model based on the multi-dimensional loss to obtain a target style radial base function model, obtaining a text instruction, and driving the target style radial base function model to generate a 3D scene stylized rendering result based on the text instruction. Optionally, in a first implementation manner of the first aspect of the present invention, the constructing a hybrid radial basis field representation architecture includes obtaining a learnable parameter set, defining an adaptive radial basis function kernel set based on the learnable parameter set, performing kernel capacity expansion processing on the adaptive radial basis function kernel set by adopting a multi-frequency sinusoidal synthesis algorithm to obtain an optimized radial basis function kernel set, constructing a coarse grid backbone network, forming a complementary feature extraction branch with the optimized radial basis function kernel set, and setting a neural network decoder to complete construction of the hybrid radial basis field representation architecture. Optionally, in a second implementation manner of the first aspect of the present invention, the learnable parameter set includes a position and a shape, the constructing an initial style radial basis function model based on the mixed radial basis field representation architecture includes acquiring an actual scene multi-view image and a preset weight distribution rule, acquiring a plurality of initial 3D sampling points based on the actual scene multi-view image, and performing weight distribution on each initial 3D sampling point based on the weight distribution rule to obtain weighted 3D sampling points, performing clustering processing on the weighted 3D sampling points by using a clustering algorithm to obtain a plurality of clustering centers, and initializing the position and the shape of the optimized radial basis function kernel set based on the plurality of clustering centers to construct the initial style radial basis function model. Optionally, in a third implementation manner of the first aspect of the present invention, the learnable parameter set further includes a feature weight, and the performing multiple iterative training on the initial style radial basis function model to obtain a style radial basis function model includes sampling based on the actual scene multi-view image to obtain camera light data, obtaining actual color data corresponding to the camera light data, inputting the camera light data into the ini