CN-121999112-A - Three-dimensional target coloring method based on rendering graph
Abstract
The invention discloses a three-dimensional target coloring method based on a rendering graph, and belongs to the technical field of computer vision and graphic image processing. The method comprises the steps of 1, obtaining a colored rendering image from a three-dimensional model by using a visible light sensor or a rendering program to construct an input image set, 2, extracting three-dimensional grids from triangular surface representation of the three-dimensional model, 3, obtaining an initial image feature sequence by using a ViT model and a transformer model to obtain the initial grid feature sequence by using an image patch segmentation mode of the rendering image and a triangular surface feature vector mode of the three-dimensional grids, 4, obtaining the image feature sequence and the grid feature sequence by using a self-attention mechanism, 5, calculating the image feature sequence and the grid feature sequence by using a multi-layer alternating attention mechanism to obtain a camera feature sequence, 6, slicing and extracting camera head features of the camera feature sequence and mapping the camera head features into explicit camera parameters, and 7, coloring by using the explicit camera parameters and simulated light projection as the three-dimensional model.
Inventors
- LI KAN
- MAO ZHUQING
Assignees
- 北京理工大学
Dates
- Publication Date
- 20260508
- Application Date
- 20251201
Claims (7)
- 1. A three-dimensional target coloring method based on a rendering graph is characterized by comprising the following steps, Step 1, obtaining colored rendering images from a three-dimensional model by utilizing a visible light sensor or a rendering program to construct an input image set { I 1 ,I 2 ,…,I N }, wherein N is the total number of the input images; Step 2, extracting a three-dimensional grid M= { P, F } from triangular surface representation of the three-dimensional model, wherein P represents a node, and F represents a triangular surface; step 3, acquiring an initial image characteristic sequence in an image patch segmentation mode for a rendering image by utilizing a ViT model and a transducer model, and acquiring an initial grid characteristic sequence in a triangular surface characteristic vector mode for a three-dimensional grid; step 4, acquiring an image characteristic sequence and a grid characteristic sequence by using a self-attention mechanism; step 5, calculating the image characteristic sequence and the grid characteristic sequence by using a multilayer alternating attention mechanism to obtain a camera characteristic sequence; Step 5.1, adding camera head features to the image feature sequence, and stacking to form an initial global image feature sequence shown in a formula (16); E′ global =stack{E i },i=1,2,...,N (16) wherein E i =stack{f cls ,y i , i=1, 2,..n, Representing the characteristics of the camera head and, A feature sequence representing figure i Zhang Xuanran after adding camera head features; And 5.2, performing position coding on the initial global image feature sequence in a mode shown in a formula (17), so as to obtain an initial global image feature sequence of the first layer: E global =E′ global +Pos 2 (17) Wherein, the Coding for a position; Step 5.3, acquiring a global image feature sequence of the current layer by using a global self-attention mechanism; step 5.3.1, acquiring a Query matrix, a Key matrix and a Value matrix by utilizing an initial global image feature sequence of the current layer; Wherein, the C represents the current layer number as a learnable parameter; 5.3.2, performing multi-head decomposition on the Query matrix, the Key matrix and the Value matrix in the formula (18) to obtain a global sub-hidden state sequence shown in the formula (19); Where split represents decomposing the input matrix in the lowest dimension and stitching in the highest dimension, H 3 represents the number of attention heads, Representing the hidden layer dimension of each header; Step 5.3.3, calculating the scaling dot product attention of each head to obtain a single-head self-attention weighted value shown in a formula (20); Wherein, the A self-attention weighting value representing the b-th head of the layer c network, b=1, 2,..h 3 ; 5.3.4, in the dimension of the hidden layer, splicing and linearly transforming the single-head self-attention weighted value in a mode shown in a formula (21); Wherein, the For trainable parameters, concat means stitching in the lowest dimension, 5.3.5, Calculating to obtain a global image feature sequence of the current layer through a two-layer feedforward neural network in a mode shown in a formula (22); Wherein the method comprises the steps of In order for the parameters to be trainable, Hiding layer dimensions for the feed-forward neural network; LN represents layer normalization; step 5.4, carrying out local decomposition on the global image feature sequence of the current layer to obtain a local image feature sequence, carrying out attention calculation on the local image feature sequence and the grid feature sequence by utilizing a local cross attention mechanism, and acquiring the global image feature sequence of the next layer after splicing; 5.4.1, decomposing the global image feature sequence of the current layer according to a first dimension, wherein the first dimension is shown as a formula (23); Representing a local image feature sequence corresponding to each graph; step 5.4.2, obtaining a Query matrix, a Key matrix and a Value matrix shown in a formula (24) according to the local image feature sequence; Wherein, the Is a learnable parameter; 5.4.3, performing multi-head decomposition on the Query matrix, the Key matrix and the Value matrix in the formula (24) to obtain a local sub-hidden state sequence shown in the formula (25); Where split represents decomposing the input matrix in the lowest dimension and stitching in the highest dimension, H 4 represents the number of attention heads, Representing the hidden layer dimension of each header; Step 5.4.4, calculating the dot product attention of scaling of each head to obtain a single-head cross attention weighted value shown in a formula (26); Wherein, the A cross-attention weighting value representing the layer c network, e = 1,2,..h 4 ; 5.4.5, in the dimension of the hidden layer, splicing and linearly transforming the single-head cross attention weighted value in a mode shown in a formula (27); Wherein, the For trainable parameters, concat means stitching in the lowest dimension, Step 5.4.6, calculating to obtain a global image feature sequence of the next layer through a two-layer feedforward neural network in a mode shown in a formula (28); Wherein the method comprises the steps of In order for the parameters to be trainable, Hiding layer dimensions for the feed-forward neural network; LN represents layer normalization; Setting iteration times, and executing the steps 5.3 to 5.4 in a cyclic iteration mode until the iteration times are reached, so as to obtain a camera characteristic sequence; Step 6, slice extraction is carried out on the camera head features of the camera feature sequence and the camera head features are mapped into explicit camera parameters shown in a formula (29); Wherein the method comprises the steps of The camera external parameters including quaternion and offset can be directly obtained in a slicing mode through cam i , and the camera internal parameters include focal length and principal point coordinates; and 7, coloring the three-dimensional model by using the explicit camera parameters and the simulated light projection.
- 2. The method for rendering a three-dimensional object according to claim 1, wherein the step 3 is implemented by, Step 3.1, acquiring an initial image feature sequence by utilizing ViT models in a patch segmentation and position coding mode; And 3.2, acquiring an initial grid characteristic sequence from the three-dimensional grid in a triangular surface characteristic vector mode.
- 3. The method for rendering a three-dimensional object according to claim 2, wherein the step 3.1 is implemented by, Step 3.1.1, acquiring a patch space sequence for image segmentation as shown in the formula (1); Where stack represents vectors stacked in a first dimension, X i represents the patch sequence corresponding to the i Zhang Xuanran th diagram, Representing the result of flattening the pixels of each patch from low to high dimensions, P representing the size of the image patch, C representing the number of color channels of the image, Representing the total number of patches; Mapping the patch sequence into ViT hidden layer dimensions to obtain an embedded representation shown in a formula (2); Wherein, the For trainable parameters, h represents the hidden layer dimension, The embedded representation of each patch and the stacked patch embedded representation are respectively; Step 3.1.3, performing position coding of the patch space sequence on the embedded representation in a mode shown in a formula (3), so as to obtain ViT layer 1 initial image feature sequences: Wherein, the The initial image feature sequence of the first layer of ViT is recorded as
- 4. The method for rendering a three-dimensional object based on a rendering graph of claim 2, wherein the step 3.2 is performed by, Step 3.2.1, obtaining a 10-dimensional surface feature vector of the three-dimensional grid; step 3.2.2, adopting a multi-resolution self-adaptive parameterization method to the three-dimensional grid M, and extracting grid characteristics of the grid-divided subareas shown in the formula (9) by gradually simplifying the three-dimensional grid and establishing a globally consistent parameterization map; T′ i =concat{m(f i )},f i ∈F i (9) Wherein, the three-dimensional grid M is divided into R sub-areas M 1 ,M 2 ,...,M R , all surfaces contained in one of the sub-areas M i are marked as F i , the concat represents vector splicing, N is the number of faces contained in the sub-area M i , and n is a fixed value; Step 3.2.3, performing linear transformation on the T' i once in a mode shown in the step (10) to enable the dimension of the grid characteristics to be consistent with the image characteristics; T i =T′ i W T +b T (10) Wherein, the As a learnable parameter, T i is a normalized grid feature corresponding to the subregion M i ; Step 3.2.4, adding position codes to the sub-region grid characteristics to obtain an initial grid characteristic sequence shown in a formula (11); t=stack{T 1 ,T 2 ,...,T R }+Pos 2 (11) stack represents a stack in the highest dimension, For position coding, and in layer l the grid feature sequence is denoted t l .
- 5. The method for rendering a three-dimensional object based on a rendering graph of claim 4, wherein step 3.2.1 is performed by, Step 3.2.1.1, obtaining the area A (f i ) of the triangular surface f i by adopting the mode of the formula (4); calculating the surface normal n f of the triangular surfaces F i in the manner shown in the formula (5), wherein for each triangular surface F i in F, the corresponding triangular point {a,b,c}={(x 1 ,y 1 ,z 1 ),(x 2 ,y 2 ,z 2 ),(c 3 ,y 3 ,z 3 )},; is set to ensure that u=b-a, v=c-a and w=b-c; Step 3.2.1.2, obtaining a surface normal n f of the triangular surface f i by adopting a mode of a formula (5); step 3.2.1.3, obtaining a point normal n a by adopting a mode of a formula (6); Wherein adj (a) refers to all adjacent nodes of the vertex a, w f is a weighted value, and the area A (f) of the surface f is generally taken; Step 3.2.1.4, obtaining the angles of three included angles of the triangular surface according to the mode shown in the formula (7): And 3.2.1.5, obtaining a 10-dimensional surface feature vector according to the mode shown in the formula (8). m(f i )=(A(f i ),α,β,γ,n f ,n f ·n a ,n f ·n b ,n f ·n c ) (8)
- 6. The method for rendering a three-dimensional object according to claim 1, wherein the step 4 is implemented by, Step 4.1, acquiring a Query matrix, a Key matrix and a Value matrix by adopting an initial image characteristic state sequence and an initial grid characteristic sequence in a mode shown as a formula (12); Wherein, the Z represents an initial image characteristic sequence and an initial grid characteristic sequence; 4.2, carrying out multi-head decomposition on the Query matrix, the Key matrix and the Value matrix to obtain a sub-hidden state sequence shown in a formula (13); Wherein split means decomposing the input matrix in the lowest dimension and stitching in the highest dimension, H means the number of attention heads, Representing the hidden state dimension of each header, K 1 representing the length of the initial feature sequence and the length of the initial mesh feature sequence; Step 4.3, calculating the scaling dot product attention of each head to obtain a single-head self-attention weighted value shown in a formula (14); Wherein, the A self-attention weighting value representing the first of the layer i network, a=1, 2,..h; 4.4, in the dimension of the hidden layer, splicing and linearly transforming the single-head self-attention weighted value in a mode shown in a formula (15); Wherein, the As a trainable parameter, concat represents stitching in the lowest dimension; Step 4.5, obtaining the output of the network of the layer through a two-layer feedforward neural network in a mode shown in a formula (16); Wherein, the As trainable parameters, d ff is hidden layer dimension of the feedforward neural network, s l =LN(z l +M l ); And 4.6, setting iteration times, and executing the steps 4.1 to 4.5 in a cyclic iteration mode until the iteration times are reached to obtain an image feature sequence y i and a grid feature sequence T.
- 7. The method for rendering a three-dimensional object according to claim 1, wherein the step 7 is implemented by, Step 7.1, performing two-dimensional spatial sampling on a single rendering image in an equidistant stepping mode; Step 7.2, setting NULL for an initial color set for a single node of the three-dimensional model, and constructing a simulated light L (x) shown in a formula (30) for a sampling position through pinhole camera center coordinates and bilinear interpolation; L(x)=O+xD (30) Wherein, the The method comprises the steps of obtaining a color corresponding to a simulated light L (x), namely the color of a (u, v) point by bilinear interpolation, setting an external parameter rotation matrix as R and an offset as t in a pinhole camera model, wherein the direction vector is from a camera center coordinate of O= -R T t to a coordinate of A= (u, v), and (u, v) is any point on an image; Step 7.3, acquiring the intersection point of the simulated light L (x) and the surface of the three-dimensional model as a hit point, and filling the color corresponding to the simulated light into a color set of the surface node closest to the hit point; Step 7.4, executing the steps 7.1 to 7.3 in a cyclic iteration mode until the operation is completed on all the rendered images; And 7.5, coloring the three-dimensional model by using the average value of the non-empty surface node color set.
Description
Three-dimensional target coloring method based on rendering graph Technical Field The invention relates to a three-dimensional target coloring method based on a rendering graph, and belongs to the technical field of computer vision and graphic image processing. Background In the fields of computer graphics and computer vision, how to realize the appearance reconstruction of a three-dimensional target model is an important research direction. Appearance reconstruction refers to texture generation or mesh node coloring of a three-dimensional model according to given reference information. The traditional three-dimensional target appearance reconstruction method is dependent on manual UV mapping modeling and texture drawing, requires a great deal of labor cost and expertise, has low automation degree, and is difficult to adapt to the requirements of large-scale three-dimensional model appearance reconstruction. With the development of techniques such as differential rendering and neural radiation fields (NeRF), researchers have begun to explore methods for reconstructing the appearance of three-dimensional models using two-dimensional rendering maps to improve the coloring efficiency. However, most of the existing three-dimensional object coloring methods focus on reconstructing a model and an appearance simultaneously according to a two-dimensional rendering map, and in an application scene where a three-dimensional model is known, the coloring process lacks the sensing and utilization capability of the known three-dimensional model. Some existing technologies can understand the hint image in an end-to-end manner based on the image and the three-dimensional model, generate textures by using diffusion and the like, and further convert the textures into node coloring. The shading process of such methods is based on an abstract understanding of the image content, lacking in the interpretability and traceability of the shading process. Therefore, how to extract color information from a rendering graph to realize automatic coloring of a three-dimensional object has become a problem to be solved. Disclosure of Invention The invention aims to solve the technical problem of realizing automatic coloring of a three-dimensional target by extracting color information from a rendering graph, and provides a three-dimensional target coloring method based on the rendering graph. The invention adopts a two-stage workflow, wherein the first stage enables the model to fully understand the content of the known three-dimensional model and the image, predicts the position relationship between the two, and the second stage transfers the color from the rendering diagram to the surface of the three-dimensional model according to the position relationship. The method comprises the steps of firstly encoding each input image by using a DINOv < 2 > pre-trained ViT image encoder, adding a classification feature for the head of a feature sequence corresponding to each image, secondly extracting point cloud and surface normal from a known three-dimensional model, using a transducer network to encode a segmented three-dimensional model grid into a three-dimensional model feature sequence, and then adopting a plurality of alternating cross attention layers to respectively calculate cross attention of all the image feature sequences, the feature sequences in a single image and the model feature sequences and performing self-attention calculation after the cross attention. And after all the attention layers are calculated, head classification features of the feature sequences corresponding to each image are taken out and input into a camera head network based on a multi-layer perceptron, and the obtained result is camera parameters corresponding to the rendering graph, including parameters in a camera and parameters outside the camera. In the second stage, according to the camera parameters obtained in the first stage, each image emits light rays from the camera optical center position through the rendering graph, the intersection points of all the light rays and the model surface are obtained, and the intersection points of the model surface are colored according to the color of the passed image position. And combining the coloring results of the multiple angles to obtain the colored three-dimensional model. The invention aims at realizing the following technical scheme: The invention discloses a rendering graph-based three-dimensional target coloring method, which is applied to a three-dimensional target model appearance reconstruction scene and comprises the following steps: Step 1, obtaining colored rendering images from a three-dimensional model by utilizing a visible light sensor or a rendering program to construct an input image set { I 1,I2,…,IN }, wherein N is the total number of the input images; Step 2, extracting a three-dimensional grid M= { P, F } from triangular surface representation of the three-dimensional model, wherein P rep