US-20260127825-A1 - GENERATING A RENDERED IMAGE OF A THREE-DIMENSIONAL OBJECT

US20260127825A1US 20260127825 A1US20260127825 A1US 20260127825A1US-20260127825-A1

Abstract

Data representing a 3D object to be rendered is obtained at a user device, the data comprising: a mesh structure defining 3D co-ordinates for each of a plurality of object vertices; and a feature vector encoding visual characteristics for an object surface defined by the plurality of object vertices. An artificial neural network, ANN, is selected from a plurality of ANNs stored on the user device, each of the plurality of ANNs being configured to output pixel colour values for the object surface based on at least some of the visual characteristics encoded in the feature vector, wherein different ANNs have different numbers of layers and/or different numbers of parameters, wherein the selection is based on a resource characteristic of the user device. The mesh structure is processed using the selected ANN to generate a rendered image of the three-dimensional object.

Inventors

Ioannis Andreopoulos
Matthias Sebastian Treder
Sebastian Alexander Lutz
Pinaki Nath Chowdhury
Jia-Jie LIM

Assignees

SONY INTERACTIVE ENTERTAINMENT EUROPE LIMITED

Dates

Publication Date: 20260507
Application Date: 20251121

Claims (20)

1 . A computer-implemented method for generating a rendered image of a three-dimensional object, the method comprising, at a user device: obtaining data representing a three-dimensional object to be rendered, the data comprising: a mesh structure modelling a geometry of the three-dimensional object, the mesh structure defining three-dimensional co-ordinates for each of a plurality of object vertices; and a feature vector encoding visual characteristics of an object surface defined by the plurality of object vertices; selecting an artificial neural network, ANN, from a plurality of ANNs stored on the user device, each of the plurality of ANNs being configured to output pixel colour values for the object surface based on at least some of the visual characteristics encoded in the feature vector, wherein different ANNs in the plurality of ANNs have different numbers of layers and/or different numbers of parameters, wherein the selection is based on a resource characteristic of the user device; and processing the mesh structure using the selected ANN to generate a rendered image of the three-dimensional object.
2 . The method according to claim 1 , wherein different ANNs in the plurality of ANNs are configured to use different amounts of information encoded in the feature vector to determine the pixel colour values for the object surface.
3 . The method according to claim 1 , wherein the feature vector encodes a radiance field for the three-dimensional object.
4 . The method according to claim 1 , wherein the feature vector comprises a multi-dimensional map of texture features configured to map any point on any surface of the mesh structure to at least one texture feature of the multi-dimensional map.
5 . The method according to claim 4 , wherein a first ANN in the plurality of ANNs is configured to take only a first portion of the multi-dimensional map as input, and wherein a second ANN in the plurality of ANNs is configured to take the entire multi-dimensional map as input.
6 . The method according to claim 1 , wherein the selected ANN comprises a first ANN, and wherein the method comprises: determining a change in the resource characteristic of the user device; based on the determined change, selecting a second, different, ANN from the plurality of ANNs stored on the user device; and processing the mesh structure using the second ANN to generate a further rendered image of the three-dimensional object.
7 . The method according to claim 6 , wherein determining the change in the resource characteristic comprises determining a current computational load of the user device.
8 . The method according to claim 6 , wherein determining the change in the resource characteristic comprises determining whether the user device is in a power-saving mode and/or is being powered by a battery.
9 . The method according claim 1 , wherein each ANN in the plurality of ANNs comprises a multilayer perceptron, MLP, configured to transform at least some of the visual characteristics encoded in the feature vector into the pixel colour values.
10 . The method according to claim 1 , the method comprising: receiving, in an initial or offline stage, at least one of the mesh structure and the feature vector; and storing the at least one of the mesh structure and the feature vector in storage of the user device.
11 . The method according to claim 1 , wherein the resource characteristic of the user device comprises one or more of: processing resources of the user device, memory resources of the user device, power resources of the user device; and a display size associated with the user device.
12 . The method according to claim 1 , wherein each ANN in the plurality of ANNs is trained by minimising a loss function between pixel colour values predicted by the ANN and pixel colour values of at least one existing image, the loss function comprising at least one of a photometric loss function, a silhouette loss function and a regularisation loss function.
13 . The method according to claim 1 , wherein the three-dimensional object comprises one of: a human head avatar for videoconferencing; and a video game character.
14 . The method according to claim 1 , wherein obtaining the data representing the three-dimensional object comprising retrieving the mesh structure from storage of the user device.
15 . The method according to claim 1 , wherein obtaining the data representing the three-dimensional object comprises retrieving the feature vector from storage of the user device.
16 . The method according to claim 1 , wherein the method comprises receiving motion information indicative of motion of the three-dimensional object in a scene, and wherein processing the mesh structure comprises: deforming the mesh structure by adjusting the co-ordinates for one or more of the plurality of object vertices based on the received motion information; and processing the deformed mesh structure using the selected ANN to generate the rendered image of the three-dimensional object.
17 . The method according to claim 1 , wherein processing the mesh structure comprises processing the mesh structure using a graphics pipeline comprising a vertex shader and a fragment shader, wherein the fragment shader comprises the selected ANN.
18 . The method according to claim 1 , wherein the feature vector and the plurality of ANNs are trained simultaneously in an end-to-end manner using back-propagation of errors.
19 . A computing device comprising: a processor; and memory; wherein the computing device is arranged to perform, using the processor, operations comprising: obtaining data representing a three-dimensional object to be rendered, the data comprising: a mesh structure modelling a geometry of the three-dimensional object, the mesh structure defining three-dimensional co-ordinates for each of a plurality of object vertices; and a feature vector encoding visual characteristics of an object surface defined by the plurality of object vertices; selecting an artificial neural network, ANN, from a plurality of ANNs stored on the user device, each of the plurality of ANNs being configured to output pixel colour values for the object surface based on at least some of the visual characteristics encoded in the feature vector, wherein different ANNs in the plurality of ANNs have different numbers of layers and/or different numbers of parameters, wherein the selection is based on a resource characteristic of the user device; and processing the mesh structure using the selected ANN to generate a rendered image of the three-dimensional object.
20 . A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising: obtaining data representing a three-dimensional object to be rendered, the data comprising: a mesh structure modelling a geometry of the three-dimensional object, the mesh structure defining three-dimensional co-ordinates for each of a plurality of object vertices; and a feature vector encoding visual characteristics of an object surface defined by the plurality of object vertices; selecting an artificial neural network, ANN, from a plurality of ANNs stored on the user device, each of the plurality of ANNs being configured to output pixel colour values for the object surface based on at least some of the visual characteristics encoded in the feature vector, wherein different ANNs in the plurality of ANNs have different numbers of layers and/or different numbers of parameters, wherein the selection is based on a resource characteristic of the user device; and processing the mesh structure using the selected ANN to generate a rendered image of the three-dimensional object.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of PCT Application No. PCT/GB2024/051348, filed on May 24, 2024, which claims priority to U.S. Provisional Application No. 63/503,981, filed on May 24, 2023, the disclosures of which are incorporated by reference. TECHNICAL FIELD The present disclosure concerns computer-implemented methods of processing image data, and in particular of generating rendered images of three-dimensional objects, Background In traditional graphics pipelines, scenes and their constituent objects are represented by one or multiple meshes (e.g. contiguous sets of triangles) that capture the 3D structure combined with several auxiliary texture and feature maps that represent additional surface and reflectance properties such as texture, roughness, bumps, specularities etc. While these techniques are pervasive and widely used in graphics applications from computer games to augmented and virtual reality, architecture and car design, they have limitations due to their discrete approximation of potentially complex and dynamic objects with a set of triangles. Fidelity and realism typically increases with the number of triangles and the size of texture maps, but large meshes pose computational limits to graphics pipelines. There are a number of examples of objects that lead to prohibitively large meshes and texture maps that may lead to performance bottlenecks. First, highly dynamical structures such as fluids and clouds are difficult to be represented by a static mesh. Second, due to the high human sensitivity to faces and facial expressions, and the high amount of idiosyncrasy across individuals, representing human faces accurately requires a potentially prohibitively large amount of resources including fine meshes to capture individual head geometry, multiple high-resolution texture maps to accurately model details such as bumps, hair, pimples, and beauty spots, and high-dimensional motion primitives to capture idiosyncratic facial expressions that are closely tied to an individual's identity. For computational reasons, real-world faces are typically approximated by blendshapes, parametric face models that approximate novel faces as a linear combination of basis functions or eigenfaces. The expressivity of these models is limited by the crudeness of using a linear basis of the approximation, and despite many improvements and non-linear extensions of these models over the years, the ‘uncanny valley’ effect persists. In recent years, an alternative approach known as neural radiance fields (NeRFs) has become hugely popular in academic research and offers a new way to represent 3D scenes. NeRFs implicitly learn object shape, structure, reflectance, texture etc directly from a set of training images and camera view parameters. If done well, the scene can be rendered from novel viewpoints at high fidelity and spatial consistency. That is, rendered images are photo-realistic and the uncanny-valley effect that is so prevalent in mesh-based approaches is much reduced or absent. However, neural radiance fields also come with some drawbacks, First, their rendering time is very slow compared to the modern 3D graphics pipeline. While real-time rendering of a NeRF is possible on state-of-the-art hardware, it comes at higher usage of computational resources, does not scale well, and is much more difficult to achieve on mobile devices. Second, NeRFs are generative methods and come with the common pitfalls of these approaches: If a scene is rendered from a viewpoint close to the ones seen during training, the rendered images generally look adequate. Rendered from viewpoints far from those seen in the training data, however, the rendered image can contain many artefacts that severely degrade the quality of the image. Since the 3D geometry of the scene is learned implicitly, it is possible that the method learns structures that look viable from the training viewpoints, but inconsistent when viewed from new angles. This problem is exacerbated when moving from static scenes without motion to dynamic scenes that involve movements of parts or other time-contingent visual changes. In dynamic scenes, it is significantly more difficult to disentangle the base structure of a scene and the dynamic motions. This is less of a problem when using traditional methods and meshes, since those have a fixed 3D geometry which limits the range of possible artefacts and there is a large body of research data available to animate them. The present disclosure seeks to solve or mitigate some or all of these above-mentioned problems. Alternatively and/or additionally, aspects of the present disclosure seek to provide improved methods of generating rendered images of three-dimensional objects. SUMMARY In accordance with a first aspect of the present disclosure there is provided a computer-implemented method for generating a rendered image of a three-dimensional object, the method comprising, at a user devi