WO-2026093367-A1 - METHOD FOR DETECTING AT LEAST ONE OBJECT, AND CORRESPONDING ELECTRONIC DEVICE, SYSTEM, COMPUTER PROGRAM PRODUCT AND MEDIUM

WO2026093367A1WO 2026093367 A1WO2026093367 A1WO 2026093367A1WO-2026093367-A1

Abstract

The invention relates to a method for detecting an object, implemented in an electronic device, the method comprising: • Obtaining multimodal images representing the same real and/or virtual spatio-temporal scene, • Obtaining an overall image from the multimodal images, by applying an attention model; • Detecting an object present in the scene by means of inference by an artificial intelligence model using the overall image and/or a tensor representing the overall image as input. The invention also relates to the corresponding electronic device, computer program product and medium.

Inventors

DUPAS, Yoann
HOTEL, Olivier
LEFEBVRE, Grégoire

Assignees

ORANGE

Dates

Publication Date: 20260507
Application Date: 20251029
Priority Date: 20241031

Claims (13)

1. Object detection method comprising: • Obtaining multimodal images representing the same real and/or virtual spatio-temporal scene, • Obtaining a global image from said multimodal images, by applying at least one attention model; • Detection of at least one object present in said scene by inference of an artificial intelligence model taking said global image as input.
2. Object detection method according to claim 1 wherein said artificial intelligence model is adapted to object detection in a single-mode 2D image.
3. Object detection method according to claim 1 or 2 wherein said global image is obtained by applying at least one first attention model to an image resulting from a fusion of said multimodal images.
4. Object detection method according to any one of claims 1 to 3 wherein at least a first of the fused multimodal images is pre-filtered by application of at least a second attention model.
5. Object detection method according to any one of claims 1 to 4 wherein at least one second of the merged multimodal images is pre-filtered by application of said second attention model.
6. Object detection method according to any one of claims 1 to 5 wherein the set of said merged multimodal images is pre-filtered by application of said second attention model.
7. Object detection method according to any one of claims 1 to 4 wherein at least one second of the merged multimodal images is pre-filtered by applying at least one third attention model, different from said second attention model.
8. Object detection method according to any one of claims 1 to 7 wherein the method comprises obtaining an overall image from each component of said multimodal images.
9. Object detection method according to any one of claims 1 to 8 wherein the method comprises a joint rendering of at least one geometric shape encompassing said at least one detected object in said global image and of a label representative of the classification of said detected object by said artificial intelligence model.
10. Object detection method according to any one of claims 1 to 9 wherein the method includes joint learning of said attention models and said artificial intelligence model.
11. Electronic device comprising at least one processor, said at least one processor being configured for object detection comprising • Obtaining multimodal images representing the same real and/or virtual spatio-temporal scene, • Obtaining a global image from said multimodal images, by applying at least one attention model; • Detection of at least one object present in said scene by inference of an artificial intelligence model taking said global image as input.
12. A computer program product comprising instructions for the implementation, when the program is executed by a processor of an electronic device, of an object detection method implemented in an electronic device, said method comprising • Obtaining multimodal images representing the same real and/or virtual spatio-temporal scene, • Obtaining a global image from said multimodal images, by applying at least one attention model; • Detection of at least one object present in said scene by inference of an artificial intelligence model taking said global image as input.
13. A processor-readable recording medium on which is recorded a computer program comprising instructions for the implementation, when the program is executed by a processor of an electronic device, of an object detection method, said method comprising: • Obtaining multimodal images representing the same real and/or virtual spatio-temporal scene, • Obtaining a global image from said multimodal images, by applying at least one attention model; • Detection of at least one object present in the scene by inference using an artificial intelligence model that takes the overall image as input.

Description

DESCRIPTION Title of the invention: Method for detecting at least one object, electronic device, system, product, computer program and corresponding medium. 1. Technical field This application relates to the field of automatic object detection (sometimes called "Computer Vision") in a real or virtual scene, such as detection techniques implementing neural network architectures for deep learning (or "Deep Learning" in English). This application relates in particular to a method for detecting at least one object, as well as a corresponding electronic device, system, computer program product and information recording medium. 2. State of the art For several decades, we have witnessed a rise in automated systems in various fields such as industry, services, transport, agriculture and, more recently, in the private and public spheres (smart homes, smart offices, smart cities). These automated systems often rely on automatic detection of objects present in their environment. This automatic detection is often performed using computer vision techniques, which implement automatic image analysis of an environment via detector or classifier neural networks (sometimes called automatic object recognition). When applied to images, classifier or detector neural networks assign (during an inference phase) a probability that an image (or portion of an image) represents a certain class of objects from among a plurality of candidate object classes. Detector neural networks also specify the location within the image of the image portion (often called the bounding box) that contains this object class. The inference phase is preceded by a learning phase during which the neural network learns to distinguish the characteristics of the image portions corresponding to these candidate object classes from examples. For example, in the case of supervised learning of a "detector" type neural network, the examples provided are pairs, each associating with an image, a list of bounding boxes, in that image, each containing an object (called of interest). to which an object class is already assigned. The reliability of the results obtained depends greatly on the training performed, particularly on the number and diversity of the examples provided. Good training often requires providing the neural network with thousands, or even millions, of examples, which implies a significant amount of time for collecting and annotating these examples (often manually), as well as significant memory and processing capacity and processing time for these examples by the neural network being trained. Learning a neural network is therefore a very time- and resource-intensive task. However, the reliability of detection systems based on such neural network architectures can also vary greatly depending on the context in which these images are captured. Thus, it can be difficult, or even impossible, to automatically and accurately detect objects in color images taken by a camera (such as a webcam) in the dark, whereas such objects will be more easily detectable in an image acquired, under the same conditions of darkness, but with an infrared camera. Solutions based on the use of multimodal images have been developed. These solutions leverage the advantages of capturing different modalities for better adaptation to various capture contexts. However, there is a need for an object detection solution to further improve the reliability of automatic object detection systems. The purpose of this application is to propose improvements to at least some of the drawbacks of the state of the art. 3. Description of the invention This application aims to improve the situation using an object detection process comprising: • Obtaining multimodal images representing the same real and/or virtual spatio-temporal scene, • Obtaining a global image from said multimodal images, by applying at least one attention model; • Detection of at least one object present in said scene by inference of an artificial intelligence model taking as input said global image and/or a tensor representing said global image. As described in more detail later, multimodal images are defined as images acquired from diverse imaging sources (i.e., different types of sensors) and representing the same portion of a scene. The diversity of imaging sources allows multimodal images to complement each other in their representation (description) of the common portion of the scene. The object detection process can be implemented at least partially in an electronic device. Depending on the implementation of the solution described in this request, it may involve various objects, connected or not, such as vehicles, people, animals, robots, and/or machines. For the sake of brevity, the term "image" will be used in the remainder of the application to refer either to an image itself, or to a tensor of the same dimensions as that image (in terms of pixels) representing it. In some embodiments, said artificial intelligence mode