EP-4736118-A1 - SELF-SUPERVISED METHOD FOR OBTAINING DEPTH, ALBEDO AND SURFACE ORIENTATION ESTIMATES OF A SPACE ILLUMINATED BY A LIGHT SOURCE

EP4736118A1EP 4736118 A1EP4736118 A1EP 4736118A1EP-4736118-A1

Abstract

The present invention discloses a method for obtaining data of a space (1) illuminated by a light source (10) comprised by a joint camera-light source system (2), in which a neural network (3) is trained using the illumination decline profile of every pixel in a training set of 2D image data. In this way, the neural network (3) can estimate the depth (5), albedo (6) and surface orientation (7) of each pixel of 2D input image data (8) allowing for a posterior 3D reconstruction (9) of the space (1) depicted in said 2D input image data, including its structures, shapes and colors. The use of the method of the invention may be particularly relevant for processing endoscopic images, among other applications. A system comprising means to carry out the method of the invention and a computer program comprising instructions to carry out the method of the invention are also disclosed.

Inventors

RODRÍGUEZ PUIGVERT, Javier
MARTÍNEZ BATLLE, Víctor
TARDÓS SOLANO, Juan Domingo
FUA, PASCAL
MARTÍNEZ MONTIEL, José María
CIVERA SANCHO, Javier
MARTÍNEZ CANTIN, Rubén

Assignees

Universidad De Zaragoza

Dates

Publication Date: 20260506
Application Date: 20240618

Claims (15)

1.- Computer implemented method for obtaining data of a space (1) illuminated by a light source (10), said light source (10) being comprised by a joint camera-light source system (2), and the method comprising performing the following steps: a) providing a set of training joint camera-light source system 2D image data (4) of at least one space (1) comprising illumination information; b) delivering first training 2D image data (4) to a neural network (3); and characterized in that the method further comprises performing the following steps: c) obtaining, for each pixel of the first training 2D image data (4), a depth estimate (5) an albedo estimate (6) and a surface orientation estimate (7) corresponding to each pixel of the first training 2D image data (4); d) training a neural network (3) by using the depth (5), albedo (6) and surface orientation (7) estimates obtained in the previous step and by applying the illumination decline principle; e) delivering first input 2D image data (8) of a space (1) illuminated by a light source (10) comprised by a joint camera-light source system (2) to the trained neural network (3); f) obtaining, using the trained neural network (3), a final depth estimate (5’), a final albedo estimate (6’) and a final surface orientation estimate (7’) for each pixel of the first input 2D image data.
2.- Method according to the preceding claim, wherein, after step f), the following step is performed: g) obtaining a 2D or 3D reconstruction (9) of the space (1) depicted in the first input 2D image data (8) by using the depth, albedo and surface orientation estimates obtained in step f).
3.- Method according to any of the preceding claims, wherein the set of training 2D image data (4) comprises one or many, monocular or stereo, related or unrelated, consecutive or nonconsecutive, real or simulated images.
4.- Method according to any of the preceding claims, wherein the neural network (3) comprises parameters with an initial value and wherein the training of the neural network (3) in step d) is carried out by: - obtaining, from the depth (5), albedo (6), and surface orientation (7) estimates obtained in step c), first synthesized 2D image data (13) by using a rendering system (12) based on illumination decline; - updating the parameter values of the neural network by minimizing a loss function (14) that comprises the differences between the first training 2D image data (4) and the first synthesized 2D image data (13); and - repeating steps b)-d) iteratively for different training 2D image data (4) until the neural network is trained;
5.- Method according to any of the preceding claims, wherein the training of the neural network (3) in step d) is further based on geometric and/or photometric calibration parameters (18) of a joint camera-light source system (2).
6.- Method according to any of the preceding claims, wherein steps e), f) and/or g) are carried out in real time.
7.- Method according to any of the preceding claims, further comprising performing, after step e), the following step: e1) re-training the neural network (3) using as first training 2D image data (4) the input 2D image data (8) delivered in step e).
8.- Method according to any of the preceding claims, wherein the training of the neural network (3) in steps d) and/or e1) comprises a self-supervised training.
9.- Method according to any of the preceding claims, wherein steps e)-f) are repeated for each of the frames comprised by a monocular or stereo video sequence, using said frames as 2D input image data, and wherein the depth (5), albedo (6) and surface orientation (7) estimates corresponding to each of said image data are fused to obtain: - the joint camera-light source system (2) position and orientation in each image data; and/or - a 3D reconstruction (9) of an extended portion of the space (1) appearing in said image data.
10.- Method according to any of the preceding claims, further comprising, after step g), displaying in a display at least one of the following: - the final depth estimate (5’); - the final albedo estimate (6’); - the final surface orientation estimate (7’); and - the 2D or 3D reconstruction (9) of the space obtained in step g).
11.- Method according to any of the preceding claims, wherein, after step f), the resulting depth (5), albedo (6) and surface orientation (7) estimates are processed by using digital image processing means such that: - different structures comprised by the space (1) are identified; - measurements of lengths, areas, volumes and/or albedos of structures appearing in the image data are obtained; and/or - the percentage of observed structures and/or the percentage of space occluded by dirt, liquids or other obstacles is calculated.
12.- Method according to any of the preceding claims, wherein the space (1) is one of the following: - a hollow organ; - a cavity of the body; - a region of the seabed and/or other underwater space; - a cavity comprised by a rigid structure; - a pipe; - a region comprised by a sewage system; and - an underground environment.
13.- System comprising a joint camera-light source system (2) adapted to acquire 2D image data, the joint camera-light source system (2) being connected to computing means, and characterized in that the computing means comprise hardware and/or software means adapted to perform a method according to any of claims 1-11.
14.- System according to the preceding claim, wherein: - the joint camera-light source system (2) is comprised by an endoscope, a capsule endoscope, a borescope, an augmented/virtual reality device, a mobile device, a wearable device, a robotic device and/or a vehicle; - the system comprises optionally additional sensors adapted to obtain sensor data of a space (1) and/or sensor data of the joint camera-light source system (2); - the processing means are adapted to provide combined data of said space (1) by combining the sensor data, the final depth (5’), albedo (6’) and surface orientation (7’) estimates, and, optionally, the joint camera-light source system (2) position and orientation.
15.- Computer program comprising instructions which, when the program is executed by computing means, cause the computing means to carry out a method according to any of claims 1-12.

Description

DESCRIPTION SELF-SUPERVISED METHOD FOR OBTAINING DEPTH, ALBEDO AND SURFACE ORIENTATION ESTIMATES OF A SPACE ILLUMINATED BY A LIGHT SOURCE FIELD OF THE INVENTION The present invention belongs to the field of image processing and 2D-3D reconstruction by employing artificial intelligence methods such as neural networks. More specifically, the invention describes a method for obtaining data of a space depicted in an image, such as depth, albedo, and surface orientation, by means of a neural network trained under a selfsupervision technique based on illumination decline. This allows for using image data both for training and inference, leading to an accurate, robust, and domain-shift-free reconstruction of the mentioned space. BACKGROUND OF THE INVENTION Comprehension of structures, colors, and characteristics of a space, cavity or volume can be relevant for many industrial applications. Nevertheless, in many situations, the only available information of these spaces comes from image data, i.e., from a two-dimensional (2D) representation of said spaces, but most of the time, three-dimensional (3D) information may be needed. These situations can be found in a plethora of procedures, typically when the space of interest is shallow, small, or somehow else inaccessible. In these cases, probes incorporating a camera and a light source are employed to survey the space, and, by acquiring 2D images, the space characteristics are then reconstructed. In many of these cases, the size of the probe is especially relevant, since it may need to be as small as possible in order to fit in the space of interest and/or in order not to alter or interact with the structures present in said space. An example of such a situation relates to medical imaging procedures involving endoscopic techniques such as gastroscopies, colonoscopies, or bronchoscopies. These procedures require the employed imaging instruments to be small sized in order to remain as much non-invasive as possible for the patient. For this reason, endoscopes typically comprise a single camera and several illumination points within their housing, but rarely depth or stereo cameras which increment the size of the endoscope considerably. However, 3D comprehension of the explored structures is very relevant in such procedures since, for instance, obtaining accurate estimations of the size and shape of tumors or other lesions may lead to faster, and more precise diagnostics and treatments. Therefore, there is a great need for developing accurate and robust 3D reconstruction techniques suitable for application to endoscopic images, among other applications. In this regard, Artificial Intelligence (Al) algorithms and techniques have been proposed to obtain 3D reconstruction methods that provide good results for this kind of techniques, both with single-view and with multi-view depth estimation. Different works have demonstrated the effectiveness of deep neural networks for supervised pixel-wise depth regression in generic, single-view, natural images. Complementary research efforts have also made contributions in many different directions, such as, to name a few, network architectures that evolved to fully convolutional, and, more recently, to transformers. In some works, the continuous depth space is discretized into bins and the problem is formulated as an ordinal regression. Other advances include interpretability, uncertainty quantification, and modelling camera intrinsic characteristics. All these approaches rely on supervised training of neural networks, and, therefore, require depth ground-truth data, which can be difficult and expensive to acquire. Some proposals use computerized tomography (CT) renderings for depth supervision in bronchoscopies, but CT scans in particular, and ground-truth depth data in general, are very rare in endoscopy and other applications. Self-supervised methods seek to overcome this limitation and reduce the need for groundtruth data, often by exploiting multi-view photometric consistency. This also enables depth refinement at test time, but, unfortunately, this kind of supervision can be noisy, due to inaccuracies in the camera motion estimation, perspective distortions, occlusions or non- Lambertian effects, among others. As a result, state-of-the-art self-supervised methods typically suffer from significantly larger inaccuracies than supervised ones. Furthermore, many works have explored multi-view integration combined with tracking and simultaneous localization and mapping (SLAM) pipelines. Others propose video-based training schemes. Unfortunately, multi-view self-supervision is even more challenging in endoscopy than in other areas due to the presence of deformations and weak texture. Another issue that has been vastly explored regarding endoscopic 3D reconstruction is domain shift, i.e., transferring simulated results into predictions for real scenarios. In this regard, different works propose conditional GANs for depth recovery while