EP-4742174-A1 - METHODS AND DEVICES FOR DETECTION OF ARBITRARY THREE-DIMENSIONAL OBJECTS

EP4742174A1EP 4742174 A1EP4742174 A1EP 4742174A1EP-4742174-A1

Abstract

A method is disclosed. The method comprises obtaining first data from a first 2D image of a scene and second data from a second 2D image of the scene. The method comprises determining a first 2D object mask of an object that appears in the first and second 2D images. The method comprises determining a second 2D object mask of the object. The method comprises generating, for the first 2D object mask, a first depth map comprising first information indicative of a distance from the object to the camera. The method comprises determining a first scale parameter and a first shift parameter based on projecting the first information onto the second 2D image. The method comprises generating a second depth map based on the first scale parameter and the first shift parameter, the second depth map comprising second information indicative of the distance from the object to the camera.

Inventors

MOLINER, OLIVIER

Assignees

Sony Group Corporation

Dates

Publication Date: 20260513
Application Date: 20251015

Claims (15)

A method, performed by an electronic device, for detection of arbitrary three-dimensional, 3D, objects, the method comprising: - obtaining (S102) first data from a first two-dimensional, 2D, image of a scene and second data from a second 2D image of the scene; - determining (S104), based on the first data, a first 2D object mask of an object that appears in the first 2D image and the second 2D image; - determining (S106), based on the second data, a second 2D object mask of the object; - generating (S108), for the first 2D object mask, a first depth map comprising first information indicative of a distance from the object to a camera; - determining (S110) a first scale parameter and a first shift parameter based on projecting the first information onto the second 2D image; and - generating (S114), for the first 2D object mask, a second depth map based on the first scale parameter and the first shift parameter, the second depth map comprising second information indicative of the distance from the object to the camera.
The method according to claim 1, wherein generating the first depth map comprises applying a monocular depth estimation to the first 2D image.
The method according to any of the previous claims, the method comprising extracting a feature map comprising at least one point cloud based on the first 2D object mask and/or the first information, wherein generating the second depth map comprises sampling first additional data from the feature map.
The method according to any of the previous claims, wherein generating the second depth map comprises selecting a first set of points within the first 2D object mask, wherein the first set of points comprises first coordinates within the first information.
The method according to claim 4, wherein the first set of points comprises random points within the first 2D object mask; and/or wherein the first set of points comprises at least 200 points.
The method according to any of claims 4-5, wherein projecting the first information onto the second 2D image comprises reprojecting the first set of points onto the second 2D image based on the first coordinates and camera parameters, wherein the second depth map is generated based on the reprojecting.
The method according to claim 6, wherein the reprojecting yields second coordinates that are based on the first coordinates and the camera parameters.
The method according to any of claims 6-7, wherein the camera parameters comprise intrinsics and poses for the camera.
The method according to any of claims 7-8, the method comprising: - filtering a subset of points from the first set of points reprojected on the second 2D image, wherein a point of the subset has a misalignment that is greater than a threshold value; and - sampling second additional data from the feature map based on the second coordinates and the filtering; - wherein the first scale parameter and/or the first shift parameter are based on a relationship between the first additional data and the second additional data.
The method according to any of the previous claims, the method comprising - outputting a first 3D representation of the object based on the first 2D object mask and the second depth map; and/or; - outputting a second 3D representation of the object, the second 3D representation being based on the second 2D object mask and a third depth map that comprises third information indicative of another distance between the object and another camera.
The method according to claim 10, the method comprising: - determining a first bounding box for the object based on the first 3D representation; and - determining a second bounding box for the object based on the first bounding box and the second 3D representation.
The method according to any of the previous claims, the method comprising obtaining at least one text string representative of the object, wherein the first 2D object mask and/or the second 2D object mask are determined based on the at least one text string; wherein the at least one text string comprises an arbitrary value exclusive of training information for any neural network of the electronic device.
The method according to any of the previous claims, wherein obtaining the first data from the first two-dimensional, 2D, image of the scene and the second data of the second 2D image of the scene comprises: - capturing the first 2D image with an image sensor of the electronic device, wherein the electronic device is the camera and the first data is obtained from the image sensor; and/or - receiving signaling that comprises the first data and/or the second data, wherein the electronic device comprises a processing device external to the camera.
The method according to any of the previous claims, wherein the first 2D image and the second 2D image each comprise red, green and blue, RGB, image data exclusive of depth information for any pixel of any of the first 2D image and the second 2D image.
An electronic device comprising memory circuitry and processor circuitry, wherein the electronic device is configured to: - obtain first data from a first two-dimensional, 2D, image of a scene and second data from a second 2D image of the scene; - determine, based on the first data, a first 2D object mask of an object that appears in the first 2D image and the second 2D image; - determine, based on the second data, a second 2D object mask of the object; - generate, for the first 2D object mask, a first depth map comprising first information indicative of a distance from the object to the camera; - determine a first scale parameter and a first shift parameter based on projecting the first information onto the second 2D image; and - generate for the first 2D object mask, a second depth map based on the first scale parameter and the first shift parameter, the second depth map comprising second information indicative of the distance from the object to the camera.

Description

The present disclosure relates to a method or methods for detection of arbitrary three-dimensional, 3D, objects and to related devices. BACKGROUND Three-dimensional, 3D, object detectors typically employ closed-set methods, which relay on models trained to detect a limited set of predefined object categories. Extending these models to new domains requires collecting and annotating new data and retraining the models, which is both costly and time-consuming. Open-vocabulary 3D object detection is an emerging field in computer vision that aims to identify and localize arbitrary, previously unseen objects in 3D by leveraging semantic information from large-scale language models. However, existing open-vocabulary 3D object detection schemes often lack scalability and generalizability. SUMMARY As described herein, some open-vocabulary 3D detection models may operate on point-cloud and/or more complex data known as red, green, blue, depth, RGB-D, data. Handling such complex data may involve cost-prohibitively complex hardware and/or unreasonably large amounts of image data. Some open-vocabulary 3D detection models also rely on models trained on certain datasets and may not perform well when applied to other data unless the models are retrained. Accordingly, there is a need for devices and methods for detection of arbitrary three-dimensional, 3D, objects, which mitigate, alleviate or address the shortcomings existing and provide for obtaining and refining two-dimensional, 2D, image data to support reliable 3D object representations without undue hardware complexity, without unreasonably large data sets, and without the need to retrain models. A method is disclosed, performed by an electronic device, for detection of arbitrary 3D objects. The method comprises obtaining first data from a first 2D image of a scene and optionally second data from a second 2D image of the scene. The method comprises determining, based on the first data, a first 2D object mask of an object that appears in the first 2D image and the second 2D image. Optionally, the method comprises determining, based on the second data, a second 2D object mask of the object. The method comprises generating, for the first 2D object mask, a first depth map comprising first information indicative of a distance from the object to the camera. The method comprises determining a first scale parameter and a first shift parameter based on projecting the first information onto the second 2D image. The method comprises generating, for the first 2D object mask, a second depth map based on the first scale parameter and the first shift parameter. The second depth map comprises second information indicative of the distance from the object to the camera. Further, an electronic device is provided. The electronic device comprises memory circuitry and processor circuitry. The electronic device is configured to obtain first data from a first 2D image of a scene and optionally second data from a second 2D image of the scene. The electronic device is configured determine, based on the first data, a first 2D object mask of an object that appears in the first 2D image and the second 2D image. The electronic device may optionally be configured to determine, based on the second data, a second 2D object mask of the object. The electronic device is configured to generate, for the first 2D object mask, a first depth map comprising first information indicative of a distance from the object to the camera. The electronic device is configured to determine a first scale parameter and a first shift parameter based on projecting the first information onto the second 2D image. The electronic device is configured to generate for the first 2D object mask, a second depth map based on the first scale parameter and the first shift parameter, the second depth map comprising second information indicative of the distance from the object to the camera. It is an advantage of the disclosed method and electronic device that arbitrary 3D objects may be detected using an open-vocabulary scheme with 2D images as input, allowing for reduced complexity hardware. Further, the disclosed method and electronic device may advantageously enable that arbitrary 3D objects, such as objects different from those use to train a neural network and/or model, may be detected, such as identified, without the need to retrain such neural networks and/or models. Further, the disclosed method and electronic device may advantageously facilitate interaction between virtual and physical worlds and/or may provide for analytics and/or evaluation of environments (real and/or built) and people interacting with those environments. It is a further advantage of the present disclosure that it may allow for occupancy prediction, utilization analytics, asset tracking, and/or safety monitoring of indoor and/or outdoor environments, which may support planning and operational objectives for such environments. BRIEF DESCRIPTION OF THE DRAWINGS