US-12620234-B2 - Obtaining information about objects in camera images
Abstract
In a method ( 30 ) for obtaining information ( 18 a, b ) about a potential object ( 8 a, b ) in a camera image ( 10 ) of a scene ( 2 ), A) the camera image ( 10 ) is provided as a single 2D camera image, B) potential objects ( 8 a, b ) are detected in the camera image ( 10 ), C) a frame projection ( 14 a, b ) of a 3D frame ( 16 ) directly surrounding the object ( 8 a, b ) into the camera image ( 10 ) is determined for at least one of the detected objects ( 8 a, b ) as at least part of the information ( 18 a, b ), wherein steps B) and C) are each performed with the aid of a neural network ( 32 ).
Inventors
- Andreas Wellhausen
- Emil Schreiber
- Florian Richter
- Gregor Blott
- Matthias Kirschner
- Moritz Michael Knorr
- Nils Zerrer
- Sven Wagner
Assignees
- ROBERT BOSCH GMBH
Dates
- Publication Date
- 20260505
- Application Date
- 20240815
- Priority Date
- 20230815
Claims (8)
- 1 . A method ( 30 ) for obtaining information ( 18 a, b ) about an object ( 8 a, b ) in a camera image ( 10 ) of a scene ( 2 ), in which A) the camera image ( 10 ) is provided as a single 2D camera image, B) potential objects ( 8 a, b ) are detected in the camera image ( 10 ), C) for at least one of the detected objects ( 8 a, b ), a frame projection ( 14 a, b ) of a 3D frame ( 16 ) directly surrounding the object ( 8 a, b ) is determined in the camera image ( 10 ) as at least part of the information ( 18 a, b ), wherein steps B) and C) are each carried out via a neural network ( 32 ), and at least one anchor point projection ( 22 a, b ) of an anchor point correlated with the object ( 8 a, b ) is determined in the camera image ( 10 ) for at least one of the objects ( 8 a, b ) as part of the information ( 18 a, b ).
- 2 . The method ( 30 ) according to claim 1 , wherein corner points of the 3D frame are determined without using a heat map associated with the camera image ( 10 ) and/or the method ( 30 ) is performed without using calibration data of a camera generating the camera image and/or assumptions about the scene depicted in the camera image.
- 3 . The method ( 30 ) according to claim 1 , wherein steps B) and C) are carried out together via a common neural network ( 32 ).
- 4 . The method ( 30 ) according to claim 1 , wherein at least one corner point projection ( 20 a - h ) of the frame ( 16 ) into the camera image ( 10 ) is determined for at least one of the objects ( 8 a, b ) as part of the information ( 18 a, b ).
- 5 . The method ( 30 ) according to claim 1 , wherein the anchor point projection ( 22 a, b ) is determined based on the frame projection ( 14 a, b ) previously determined for the object ( 8 a, b ).
- 6 . The method ( 30 ) according to claim 1 , wherein in step B, at least one vehicle ( 9 a, b ) is detected as a potential object ( 8 a, b ) in the camera image ( 10 ).
- 7 . The method ( 30 ) according to claim 1 , wherein at least one piece of orientation information ( 24 a - c ) of the frame ( 16 ) and/or the object ( 8 a, b ) is determined for at least one of the objects ( 8 a, b ) as part of the information ( 18 a, b ).
- 8 . The method ( 28 ) for monitoring vehicles ( 9 a, b ) in a scene ( 2 ), in which the method ( 30 ) according to claim 1 is performed using the camera image ( 10 ) of the scene, wherein at least one vehicle ( 9 a, b ) is detected as a potential object ( 8 a, b ), at least one of the following measures ( 34 ) is performed on the basis of the determined information ( 18 a, b ) relating to the objects ( 8 a, b ): a count of vehicles ( 9 a, b ) on a road ( 6 ) in the scene ( 2 ), a control of a traffic light correlated with the scene ( 2 ), monitoring of a crossing of an existing or imaginary line ( 26 ) in the scene ( 2 ) by an object ( 8 a, b ), an assignment of objects ( 8 a, b ) to a lane ( 36 ) in the scene ( 2 ), detection of an incorrect position and/or movement of an object ( 8 a, b ) in the scene ( 2 ), a determination of a relation between at least two detected objects ( 8 a, b ), tracking of objects ( 8 a, b ) in the scene ( 2 ), a conversion of 2D image coordinates with respect to at least one object ( 8 a, b ) into 3D world coordinates, a speed estimation of at least one of the objects ( 8 a, b ) in the scene ( 2 ).
Description
BACKGROUND Systems and methods for performing object recognition and position estimation in 3D from 2D images are known from WO 2021/167586 A1. SUMMARY The invention relates to the detecting and obtaining of information about objects in 2D camera images depicting a real-world scene. In particular, the scene contains at least one road on which vehicles can potentially be present as objects or are regularly present. The method is used to obtain information about an object, i.e. it is an information-obtaining method. In particular, it is a method for obtaining information about an object during a monitoring task, especially when monitoring road users in a traffic scenario. In particular, in a traffic scenario, the objects about which information is to be obtained can be road users, wherein road users can be static objects, such as traffic infrastructure, or dynamic objects, such as vehicles or pedestrians. The potential object is shown or mapped in a camera image. The camera image represents one such scene. The scene is a section of the real world, for example a section of a traffic network with one or more roads, on which vehicles may move as objects. “Potential” should be understood to mean that a corresponding object is not necessarily depicted in a camera image, for example, if there is simply no object in the scene at the moment or if it is not (recognizably/visibly) depicted in the camera image. In particular, the camera image is available digitally in the form of data that can be processed by the method. In step A of the method, the camera image is provided as a single 2D camera image, e.g. by generating/recording it from the scene. “Single” should be understood to mean that the method according to the invention is only carried out on this one camera image and not, for example, on an image sequence consisting of several images. The method therefore only evaluates a single two-dimensional image or its image data. In step B, potential objects are detected in the camera image. This means that the camera image or its image content/image information is evaluated using standard methods and searched for potential objects. Any objects found are detected, e.g. marked, labeled, etc. For example, a camera image is searched for vehicles as objects using standard methods and the vehicles found are marked/localized for further processing of the image. In a step C, the following procedure is carried out for at least one, in particular several, in particular all of the objects possibly detected in step B: A frame projection is determined for the respective object as at least part of the information (i.e. at least as partial information of complete “information” that may also contain other aspects/data/partial information) that is to be obtained in the method. The frame projection is the projection into the camera image of a 3D frame (imagined in the real world) directly surrounding the object. In particular, the frame projection and the camera image can be displayed superimposed, but the display is not required for the method described. For the method, it is sufficient to determine the frame projection for the object according to step C. The frame is formed in particular by straight sections. “Directly surrounding” should be understood to mean that a surface defined by the imaginary 3D frame (e.g. polyhedron defined by the frame) surrounds the object tightly/as tightly as possible, for example with the smallest possible volume. The respective surfaces and edges of the surface stretched by the frame touch the surface of the object. In other words, a so-called “3D-bounding box” is determined as a 3D frame. The 3D frame is in particular an enveloping body in the form of a rectangular cuboid. The decisive factor here is that in step C the determination takes place purely in the two-dimensional image and only this 2-dimensional frame projection of the frame is determined. 3D considerations/evaluations/observations/aspects are not used. It is not the actual three-dimensional imaginary frame that would surround the object in 3D reality itself—i.e. in 3D space—that is calculated, but only its 2D frame projection in or into the camera image. Steps B and C are each carried out using a neural network. In step C in particular, the neural network is only fed with input variables that are taken exclusively from the 2D camera image or obtained from its data. The output of the neural network is the 2D frame projection or mapping of the imaginary 3D frame in the 2D camera image. The method therefore only processes information from the 2D camera image and is not partially executed in 3D world coordinates. 3D world coordinates are only determined—if at all—after the method, e.g. in a higher-level monitoring method. The method according to the invention is therefore a pure 2D method with regard to the camera images/the scene and the objects. In other words, the neural network used for step C “only” learns what the projection of the 3D fra