EP-4128029-B1 - METHOD AND SYSTEM OF AUGMENTING A VIDEO FOOTAGE OF A SURVEILLANCE SPACE WITH A TARGET THREE-DIMENSIONAL (3D) OBJECT FOR TRAINING AN ARTIFICIAL INTELLIGENCE (AI) MODEL

EP4128029B1EP 4128029 B1EP4128029 B1EP 4128029B1EP-4128029-B1

Inventors

NADLER, INGO
MOHR, Jan-Philipp

Dates

Publication Date: 20260506
Application Date: 20210323

Claims (14)

A method of augmenting a video footage of a surveillance space with a target three-dimensional (3D) object from one or more perspectives for training an artificial intelligence (AI) model, the method comprising: - acquiring (902) the video footage from a target camera in the surveillance space; - determining (904) a ground plane and one or more screen coordinates of one or more corners of the ground plane in the video footage; - normalizing (906) the one or more screen coordinates by calculating a homography matrix from the ground plane and determining a relative position of each of one or more objects in the ground plane; - preparing (908) a model of the target 3D object to be used for training the AI model; - iteratively generating (910) a random position and a random rotation for the target 3D object in the ground plane for positioning the target 3D object in front of or behind a distractor object from among the one or more objects in the ground plane; - rendering (912) the model of the target 3D object on the ground plane and composing the rendered 3D object and the ground plane with the acquired video footage to generate a composited image, wherein upon the relative position of the target 3D object on the ground plane being behind the relative position of the distractor object, a mask of the distractor object is used to obscure the target 3D object; and - calculating (914) coordinates of a bounding box that frames the relative position of the target 3D object in the composited image and saving the composited image along with the coordinates of the bounding box to be used subsequently for training the AI model.
A method of claim 1, wherein further comprising determining one or more edges of the ground plane and calculating a 3D rotation, scale translation relative to a camera position and a lens characteristic using an aspect ratio of the ground plane, prior to normalizing the one or more screen coordinates.
A method of claim 1 or 2, further comprising masking one or more objects standing on the ground plane by finding a bounding box around each of the one or more objects, prior to determining the relative position of each object in the ground plane.
A method of any of the preceding claims, wherein the relative position of each of the one or more objects is determined by: - multiplying the homography matrix to a center position of a lower edge of the bounding box of an object from among the one or more objects; and - generating a two-dimensional (2D) - coordinate representing the relative position of the object on the normalized ground plane.
A method of any of the preceding claims, wherein upon the video footage comprising a 360-degree video footage, prior to rendering the model of the target 3D object, the target 3D object is illuminated based on global illumination by: - determining a random image from the video footage to be used as texture on a large sphere based on the randomized position of the target 3D object relative to the ground plane by matching the position of the target 3D object and the position of recording the video footage; and - placing the random image from the video footage on the large sphere to provide a realistic lighting to the target 3D object.
A method of any of the preceding claims, wherein the ground plane is determined by applying at least one of: a computer vision algorithm or manual marking by a human.
A method of any of the preceding claims, further comprising merging at least one of: a plurality of static reflections or a plurality of time sequential reflections from an environment scene and one or more distractor objects with a plurality of simulated reflections generated by a simulated surface material property of the target 3D object and generating a bounding cube to be used for training the AI model.
A method any of the preceding claims, wherein the video footage comprises a 360-degree video footage.
A method of any of the preceding claims, wherein calculating the coordinates of the bounding box comprises: - enclosing the target 3D object in an invisible 3D cuboid; and - calculating the coordinates of one or more camera facing corners of the invisible 3D cuboid in the surveillance space.
A system (100) for augmenting a video footage of a surveillance space with a target three-dimensional (3D) object from one or more perspectives for training an artificial intelligence (AI) model, the system comprising: - a target camera (102) disposed in the surveillance space and communicatively coupled to a server (104), wherein the target camera is configured to capture the video footage of the surveillance space and transmit the captured video footage to the server and; and - the server (104) communicatively coupled to the target camera and comprising: - a memory (106) that stores a set of modules; and - a processor (108) that executes the set of modules for augmenting a video footage of a surveillance space with a target 3D object from one or more perspectives for training an AI model, the modules comprising: - a footage acquisition module (110) for acquiring the video footage from the target camera in the surveillance space; - a ground plane module (112) for: - determining a ground plane and one or more screen coordinates of the ground plane corners in the video footage; and - normalizing the one or more screen coordinates by calculating a homography matrix from the ground plane and determining a relative position of each of one or more objects in the ground plane; - a model preparation module (114) for preparing a model of the target 3D object to be used for training the AI model; - a 3D object positioning module (116) for iteratively generating a random position and a random rotation for the target 3D object in the ground plane for positioning the target 3D object in front of or behind a distractor object from among the one or more objects in the ground plane; - a rendering module (118) for rendering the model of the target 3D object on the ground plane and composing the rendered 3D object and the ground plane with the acquired video footage to generate a composited data, wherein upon the relative position of the target 3D object on the ground plane being behind the relative position of the distractor object, a mask of the distractor object is used to obscure the target 3D object; and - a training data module (120) for calculating coordinates of a bounding box that frames the relative position of the target 3D object in the composited image and saving the composited image along with the coordinates of the bounding box to be used subsequently for training the AI model.
A system of claim 10, further comprising an edge determination module (122) configured to determine one or more edges of the ground plane and calculate a 3D rotation, scale translation relative to a camera position, and a lens characteristic using an aspect ratio of the ground plane, prior to normalizing the one or more screen coordinates.
A system of claim 10 or 11, wherein the 3D object positioning module (116) is further configured to mask one or more objects standing on the ground plane by finding a bounding box around each of the one or more objects, prior to determining the relative position of each object in the ground plane.
A system of any of the claims 10 to 12, wherein 3D object positioning module (116) is further configured to: - multiply the homography matrix to a center position of a lower edge of the bounding box of an object from among the one or more objects; and - generate a two-dimensional (2D) - coordinate representing the relative position of the object on the normalized ground plane.
A system of any of the claims 10 to 13, wherein the training data module (120) is further configured to: - enclose the target 3D object in an invisible 3D cuboid; and - calculate the coordinates of one or more camera facing corners of the invisible 3D cuboid in the surveillance space.

Description

TECHNICAL FIELD The present disclosure relates to methods for augmenting a video footage of a surveillance space with a target three-dimensional (3D) object from one or more perspectives for training an artificial intelligence (AI) model. Moreover, the present disclosure also relates to systems for augmenting a video footage of a surveillance space with a target three-dimensional (3D) object from one or more perspectives for training an artificial intelligence (AI) model. BACKGROUND Typically, artificial intelligence (AI) models used in computer vision need training like any other AI model. For object detection, images of target 3D objects that are to be trained, are presented to the AI models along with "labels" that include small data files that describe the position and class of the target 3D objects in an image. Typically, training AI models requires presenting thousands of such labeled images to an AI model. A commonly used technique for acquiring the large enough number of labeled training images includes using video footage containing the desired target 3D objects and employing humans for identifying objects in the images associated with the video footage, drawing a bounding box around the identified objects and selecting an object class. Another known technique for acquiring the large enough number of labeled training images includes using 'game engines' (such as for example, Zumo labs) to create a virtual simulated environment with the target 3D objects in them, then calculating the bounding boxes, rendering a large number of images with appropriate labels. However, it is challenging to acquire a large enough number of labeled training images. The method of employing humans for identifying objects in images is an extremely long-lasting and expensive process. In the method where a virtual simulated environment is created, the main disadvantage is a very clean look of objects without real world background, which creates a weaker training set used for training, resulting in less accurate object detection. Therefore, in light of the foregoing discussion, there is a need to overcome the aforementioned drawbacks associated with the existing techniques for providing a method and a system for augmenting a video footage with a target three-dimensional (3D) object from one or more perspectives for training an artificial intelligence (AI) model. Documents WO 2019/113510 A1 and US 2019/251397 A1 describe methods for producing training data for a machine learning system by incorporating a 3D model of an object into captured images of a scene. SUMMARY The present disclosure seeks to provide a method of augmenting a video footage of a surveillance space with a target three-dimensional (3D) object from one or more perspectives for training an artificial intelligence (AI) model. The present disclosure also seeks to provide a system for augmenting a video footage of a surveillance space with a target 3D object from one or more perspectives for training an AI model. An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art by providing a technique for (semi-) automatically augmenting video footage from actual surveillance cameras with target 3D objects. Use of real video footage that includes 'distractor' objects and a blending possibility as a training set of 3D objects for training the AI models, significantly reduces training time and significantly increases the quality of training. In one aspect, the present disclosure provides a method of augmenting a video footage of a surveillance space with a target three-dimensional (3D) object from one or more perspectives for training an artificial intelligence (AI) model, the method comprising: acquiring the video footage from a target camera in the surveillance space;determining a ground plane and one or more screen coordinates of one or more corners of the ground plane in the video footage;normalizing the one or more screen coordinates by calculating a homography matrix from the ground plane and determining a relative position of each of one or more objects in the ground plane;preparing a model of the target 3D object to be used for training the AI model;iteratively generating a random position and a random rotation for the target 3D object in the ground plane for positioning the target 3D object in front of or behind a distractor object from among the one or more objects in the ground plane;rendering the model of the target 3D object on the ground plane and composing the rendered 3D object and the ground plane with the acquired video footage to generate a composited image, wherein upon the relative position of the target 3D object on the ground plane being behind the relative position of the distractor object, a mask of the distractor object is used to obscure the target 3D object; andcalculating coordinates of a bounding box that frames the relative position of the target 3D object in the composit