US-12627783-B2 - Generation of images for three-dimensional video for different viewpoints

US12627783B2US 12627783 B2US12627783 B2US 12627783B2US-12627783-B2

Abstract

An apparatus comprises a receiver ( 601 ) receiving captured video data for a real world scene and being linked with a capture pose region. A store ( 615 ) stores a 3D mesh model of the real world scene. A renderer ( 605 ) generates an output image for a viewport for a viewing pose. The renderer ( 605 ) comprises a first circuit ( 607 ) arranged to generate first image data for the output image by projection of captured video data to the viewing pose and second circuit ( 609 ) arranged to determine second image data for a first region of the output image in response to the three-dimensional mesh model. A third circuit ( 611 ) generates the output image to include at least some of the first image data and to include the second image data for the first region. A fourth circuit ( 613 ) determines the first region based on a deviation of the viewing pose relative to the capture pose region.

Inventors

Christiaan Varekamp
BARTHOLOMEUS WILHELMUS DAMIANUS VAN GEEST

Assignees

KONINKLIJKE PHILIPS N.V.

Dates

Publication Date: 20260512
Application Date: 20220624
Priority Date: 20210629

Claims (20)

1 . An apparatus, comprising: a first receiver circuit, wherein the first receiver circuit is arranged to receive captured video data, wherein the captured video data provides a dynamic representation of a real world scene, wherein the video data is linked with a capture pose region; a storage circuit, wherein the storage circuit is arranged to store a three-dimensional mesh model, wherein the three-dimensional mesh provides a static representation of a portion of the real world scene; a second receiver circuit, wherein the second receiver circuit is arranged to receive a viewing pose; and a renderer circuit, wherein the renderer circuit is arranged to generate an output image for a viewport of the viewing pose; wherein the renderer circuit comprises a first portion, a second portion, a third portion and a fourth portion, wherein the first portion is arranged to generate first image data for a portion of the viewport of a portion of the output image by view-shifting the captured video data from a capture pose to the viewing pose, wherein the second portion is arranged to generate second image data for a portion of the first viewport for at least a first region of the output image using the three-dimensional mesh model, wherein the third portion is arranged to generate the output image so as to comprise at least a portion of the first image data, wherein the third portion is arranged to generate the output image so as to comprise the second image data of the first region, wherein the fourth portion is arranged to determine the first region based of a deviation of the viewing pose relative to the capture pose region.
2 . The apparatus of claim 1 , wherein the renderer circuit is arranged to determine whether a quality of first image data generated by the first portion does not meet a quality criterion.
3 . The apparatus of claim 1 , wherein the third portion is arranged to determine if the first region based on a difference between the viewing pose and the capture pose region.
4 . The apparatus of claim 3 , wherein the difference is an angular difference.
5 . The apparatus of claim 1 , wherein the renderer circuit is arranged to change the second image data based on the captured video data.
6 . The apparatus of claim 1 , wherein the renderer circuit is arranged to change adapt the first data based on the three-dimensional mesh model.
7 . The apparatus of claim 1 , wherein the renderer circuit is arranged to change the second image data based on-to the first image data.
8 . The apparatus of claim 1 , wherein the renderer circuit is arranged to change the first image data based on the second image data.
9 . The apparatus of claim 1 , wherein the renderer circuit is arranged to change the three dimensional mesh model based on the first image data.
10 . The apparatus of claim 1 , further comprising a model generator circuit, wherein the model generator circuit is arranged to generate the three dimensional mesh model based on the captured video data.
11 . The apparatus of claim 1 , wherein the first receiver circuit is arranged to receive the video data from a remote source, wherein the first receiver circuit is arranged to receive the three dimensional mesh model from the remote source.
12 . The apparatus of claim 1 , wherein the second portion is arranged to vary a detail level of the first region based on the deviation of the viewing pose relative to the capture zone.
13 . The apparatus of claim 1 , wherein the first receiver circuit is arranged to receive second captured video data of the real world scene, wherein the second captured video data is linked with a second capture pose region, wherein the first portion is arranged to determine third image data for a portion of the output image by projection of the second captured video data to the viewing pose, wherein the third portion is arranged to determine the first region based on a deviation of the viewing pose with respect to the second capture pose region.
14 . A method, comprising: receiving captured video data, wherein the captured video data provides a dynamic representation of a real world scene, wherein the video data is linked with a capture pose region; storing a three-dimensional mesh model, wherein the three-dimensional mesh model provides a static representation of a portion of the real world scene; receiving a viewing pose; and generating an output image for a viewport of the viewing pose; the generating comprising: generating first image data of the viewport for a portion of the output image by view-shifting the captured video data from a capture pose to the viewing pose; generating second image data of the viewport for at least a first region of the output image using the three-dimensional mesh model; generating the output image so as to comprise at least a portion of the first image data and the second image data of the first region; and determining the first region based on a deviation of the viewing pose relative to the capture pose region.
15 . A computer program stored on a non-transitory medium, wherein the computer program when executed on a processor performs the method of claim 14 .
16 . The method of claim 14 , further comprising determining whether a quality of first image data does not meet a quality criterion.
17 . The method of claim 14 , further comprising determining if the first region based on a difference between the viewing pose and the capture pose region.
18 . The method of claim 17 , wherein the difference is an angular difference.
19 . The method of claim 14 , further comprising changing the second image data based on the captured video data.
20 . The method of claim 14 , further comprising changing the first data based on the three-dimensional mesh model.

Description

CROSS-REFERENCE TO PRIOR APPLICATIONS This application is the U.S. National Phase application under 35 U.S.C. § 371 of International Application No. PCT/EP2022/067371, filed on Jun. 24, 2022, which claims the benefit of EP Patent Application No. EP 21182528.6, filed on Jun. 29, 2021. These applications are hereby incorporated by reference herein. FIELD OF THE INVENTION The invention relates to an image generation approach and in particular, but not exclusively, to generation of images for a three dimensional video signal for different viewpoints. BACKGROUND OF THE INVENTION The variety and range of image and video applications have increased substantially in recent years with new services and ways of utilizing and consuming video and images being continuously developed and introduced. For example, one service being increasingly popular is the provision of image sequences in such a way that the viewer is able to actively and dynamically interact with the view of the scene such the viewer can change the viewing position or direction in the scene with the presented video adapting to present a view from the changed position or direction. Three dimensional video capture, distribution, and presentation is becoming increasingly popular and desirable in some applications and services. A particular approach is known as immersive video and typically includes the provision of views of a real-world scene, and often a real time event, that allow small viewer movements, such as relatively small head movements and rotations. For example, real-time video broadcast of e.g. a sports event that allows local client based generation of views following small head movements of a viewer may provide the impression of a user being seated in the stands watching the sports event. The user can e.g. look around and will have a natural experience similar to the experience of a spectator being present at that position in the stand. Recently, there has been an increasing prevalence of display devices with positional tracking and 3D interaction supporting applications based on 3D capturing of real-world scenes. Such display devices are highly suitable for immersive video applications providing an enhanced three dimensional user experience. In order to provide such services for a real-world scene, the scene is typically captured from different positions and with different camera capture poses being used. As a result, the relevance and importance of multi-camera capturing and e.g. 6DoF (6 Degrees of Freedom) processing is quickly increasing. Applications include live concerts, live sports, and telepresence. The freedom of selecting one's own viewpoint enriches these applications by increasing the feeling of presence over regular video. Furthermore, immersive scenarios can be conceived where an observer may navigate and interact with a live captured scene. For broadcast applications this may require real-time depth estimation on the production side and real-time view synthesis at the client device. Both depth estimation and view synthesis introduce errors and these errors depend on the implementation details of the algorithms employed. In many such applications, three dimensional scene information is often provided that allows high quality view image synthesis for viewpoints that are relatively close to the reference viewpoint(s) but which deteriorates if the viewpoint deviates too much from the reference viewpoints. A set of video cameras that are offset with respect to each other may capture a scene in order to provide three dimensional image data, for example in the form of multiple 2D images from offset positions and/or as image data plus depth data. A rendering device may dynamically process the three dimensional data to generate images for different view positions/directions as these change. The rendering device can dynamically perform e.g. view point shifting or projection to dynamically follow the user movements. An issue with e.g. immersive video is that the viewing-space, being a space wherein a viewer has an experience of sufficient quality, is limited. As the viewer moves outside the viewing space, degradations and errors resulting from synthesizing the view images become increasingly significant and an unacceptable user experience may result. Errors, artefacts, and inaccuracies in the generated view images may specifically occur due to the provided 3D video data not providing sufficient information for the view synthesis (e.g. de-occlusion data). For example, typically when multiple cameras are used to capture a 3D representation of a scene, playback on a virtual reality headset tends to be spatially limited to virtual viewpoints that lie close to the original camera locations. This ensures that the render quality of the virtual viewpoints does not show artefacts, typically the result of missing information (occluded data) or 3D estimation errors. Inside the so-called sweet spot viewing zone, rendering can be done directly from on