KR-20260067517-A - METHOD FOR DETECTING 3-DIMENSIONAL OBJECT THROUGH SYNTHESIS OF VIDEO ACQUIRED FROM VEHICULAR CAMERA AND 3-DIMENSIONAL OBJECT DETECTING DEVICE USING THE SAME

KR20260067517AKR 20260067517 AKR20260067517 AKR 20260067517AKR-20260067517-A

Abstract

The present invention relates to a method for detecting a three-dimensional object through image synthesis of a vehicle camera, comprising: (a) when an input image is captured at a first camera position of a first camera mounted on a vehicle, a three-dimensional object detection device acquiring a depth map representing the depth values of the first pixels of the input image and the input image; (b) The 3D object detection device obtains a relative pose of the second camera pose with respect to the first camera pose by referring to the second camera pose of the reference camera that captured the training image used for training a 3D object detection deep learning model and the first camera pose, and projects a preset number of the first pixels among the first pixels according to the relative pose by referring to the depth map onto a reference image corresponding to the reference camera, obtains a transformation relationship as the preset number of the first pixels are projected onto a preset number of second pixels of the reference image, and obtains a composite image by synthesizing the input image onto the reference image by inversely mapping the second pixels of the reference image to the first pixels of the input image by referring to the transformation relationship; and (c) a step of detecting a 3D object in an input image by inputting second camera pose information corresponding to the synthetic image and the second camera pose into a 3D object detection deep learning model, thereby causing the 3D object detection deep learning model to perform learning operations on the synthetic image and output information about the 3D object; the method comprises: a step of detecting the 3D object in the input image.

Inventors

김도형
최송우
오영철
유병용

Assignees

(주) 오토노머스에이투지

Dates

Publication Date: 20260513
Application Date: 20241105

Claims (20)

In a method for detecting three-dimensional objects through the synthesis of vehicle camera images, (a) When an input image is captured at a first camera position of a first camera mounted on a vehicle, a 3D object detection device acquires a depth map representing the depth values of the first pixels of the input image and the input image; (b) The 3D object detection device obtains a relative pose of the second camera pose with respect to the first camera pose by referring to the second camera pose of the reference camera that captured the training image used for training a 3D object detection deep learning model and the first camera pose, and projects a preset number of the first pixels among the first pixels according to the relative pose by referring to the depth map onto a reference image corresponding to the reference camera, obtains a transformation relationship as the preset number of the first pixels are projected onto a preset number of second pixels of the reference image, and obtains a composite image by synthesizing the input image onto the reference image by inversely mapping the second pixels of the reference image to the first pixels of the input image by referring to the transformation relationship; and (c) A step in which the 3D object detection device inputs second camera pose information corresponding to the synthetic image and the second camera pose to the 3D object detection deep learning model, thereby causing the 3D object detection deep learning model to perform learning operations on the synthetic image and output information about the 3D object, thereby detecting the 3D object in the input image; A method including
In paragraph 1, In step (b) above, The above 3D object detection device divides the input image to generate a first grid to an nth grid—where n is an integer greater than or equal to 2—and projects the coordinates of each of the m_1 to m_4 vertices of the m_1 grid—where m is an integer increasing from 1 to n—into the reference image according to the relative orientation by referencing the depth values of each of the m_1 to m_4 vertices, so that each of the m_1 to m_4 vertices is projected onto the k_1 to k_4 points of the reference image, respectively; obtains a homography matrix through a linear equation for a first corresponding pair of the m_1 vertex and the k_1 point to a fourth corresponding pair of the m_4 vertex and the k_4 point, and uses the homography matrix to obtain the mth pixels located within the mth rectangular patch of the reference image formed by the k_1 to k_4 points as the input image. A method for obtaining the composite image by obtaining the m-th pixel values of the m-th pixels by reverse mapping.
In paragraph 2, In step (b) above, The above 3D object detection device is a method for obtaining the m-th pixel values of the m-th pixels through nearest interpolation or bilinear interpolation in reverse mapping the m-th pixels to the input image.
In paragraph 2, In step (b) above, The above 3D object detection device has a method of setting the reference camera focal length included in the second camera pose to be greater than the first camera focal length included in the first camera pose in order to prevent the inverse mapping coordinate of at least one pixel among the m-th pixels in the inverse mapping from being located outside the input image.
In paragraph 1, In step (c) above, The above 3D object detection device is a method of enabling the 3D object detection deep learning model to extract image feature information by performing a convolution operation on the synthetic image at least once through a feature extraction network, and to output the 3D object information by performing a learning operation on the image feature information and the second camera pose information through a 3D object detection head network.
In paragraph 5, In step (c) above, A method in which the above 3D object detection device outputs the 3D object information by having the 3D object detection deep learning model concatenate the image feature information and the pose feature information corresponding to the second camera pose information through the 3D object detection head network and perform a learning operation on the synthesized feature information.
In paragraph 5, In step (c) above, The above 3D object detection device is a method of enabling the 3D object detection deep learning model to perform a learning operation on the bird's-eye view feature information, which is converted according to the second camera pose information through the 3D object detection head network, to output the 3D object information.
In paragraph 1, In step (a) above, The above 3D object detection device is a method for acquiring the input image and the depth map from the stereo camera, which is the first camera.
In paragraph 1, In step (a) above, The above 3D object detection device, when the input image is acquired from the mono camera, which is the first camera, acquires a t-image corresponding to a specific frame of the input image and a (t-1)-image corresponding to any one of the previous frames of the specific frame, inputs the t-image and the (t-1)-image to a depth map generating deep learning model, and causes the depth map generating deep learning model to acquire at least one feature point corresponding to the same object in the t-image and the (t-1)-image, and uses the at least one feature point to predict the attitude change of the mono camera according to the t-position information of the mono camera at the t-time when the t-image is acquired and the (t-1)-position information of the mono camera at the t-1 time when the (t-1)-image is acquired, and outputs the depth map by predicting the distance of the at least one feature point by referring to the predicted attitude change.
In paragraph 1, (d) A step in which the 3D object detection device continues to train the 3D object detection model using the synthetic image; A method that includes more.
In a 3D object detection device that performs 3D object detection through vehicle camera image synthesis, A memory storing instructions for performing 3D object detection through vehicle camera image synthesis; and A processor that performs 3D object detection through vehicle camera image synthesis according to the instructions stored in the memory; Includes, The processor comprises: (I) a process of acquiring a depth map representing the depth values of the first pixels of the input image and the input image when an input image is captured in a first camera pose of a first camera mounted on a vehicle; (II) a process of acquiring a relative pose of the second camera pose with respect to the first camera pose by referring to the second camera pose of a reference camera that captured a training image used for training a 3D object detection deep learning model and the first camera pose, and by referring to the depth map, projecting a preset number of the first pixels among the first pixels onto a reference image corresponding to the reference camera according to the relative pose, and acquiring a transformation relationship resulting from the projection of the preset number of the first pixels onto the preset number of second pixels of the reference image, and by referring to the transformation relationship, inversely mapping the second pixels of the reference image to the first pixels of the input image to obtain a composite image by synthesizing the input image onto the reference image; and (III) a second camera pose corresponding to the composite image and the second camera pose. A 3D object detection device that performs a process of detecting a 3D object in an input image by inputting information into the 3D object detection deep learning model and causing the 3D object detection deep learning model to perform learning operations on the synthetic image to output information about the 3D object.
In Paragraph 11, The above processor is, In the above process (II), the input image is divided to generate a first grid to an nth grid—where n is an integer greater than or equal to 2—and the coordinates of each of the m_1 to m_4 vertices of the m_1 grid—where m is an integer increasing from 1 to n—are projected onto the reference image according to the relative orientation by referencing the depth values of each of the m_1 to m_4 vertices, so that each of the m_1 to m_4 vertices is projected onto the k_1 to k_4 points of the reference image, respectively; a homography matrix is obtained through a linear equation for the first corresponding pair of the m_1 vertex and the k_1 point to the fourth corresponding pair of the m_4 vertex and the k_4 point, and the mth pixels located within the mth rectangular patch of the reference image formed by the k_1 to k_4 points are used as the input image using the homography matrix. A 3D object detection device that obtains the synthetic image by obtaining the m-th pixel values of the m-th pixels through reverse mapping.
In Paragraph 12, The above processor is, A 3D object detection device that, in the above process (II), obtains the m-th pixel values of the m-th pixels through nearest interpolation or bilinear interpolation when inversely mapping the m-th pixels to the input image.
In Paragraph 12, The above processor is, A 3D object detection device that, in the above (II) process, sets the reference camera focal length included in the second camera pose to be greater than the first camera focal length included in the first camera pose in order to prevent the reverse mapping coordinate of at least one pixel among the m-th pixels in the reverse mapping from being located outside the input image.
In Paragraph 11, The above processor is, A 3D object detection device that, in the above process (III), allows the 3D object detection deep learning model to perform a convolution operation on the synthetic image at least once through a feature extraction network to extract image feature information, and to perform a learning operation on the image feature information and the second camera pose information through a 3D object detection head network to output the 3D object information.
In paragraph 15, The above processor is, A 3D object detection device that, in the above process (III), enables the 3D object detection deep learning model to concatenate the image feature information and the pose feature information corresponding to the second camera pose information through the 3D object detection head network, performs a learning operation on the synthesized feature information, and outputs the 3D object information.
In paragraph 15, The above processor is, A 3D object detection device that, in the above process (III), enables the 3D object detection deep learning model to perform learning operations on the bird's-eye view feature information, which is converted according to the second camera pose information through the 3D object detection head network, and outputs the 3D object information.
In Paragraph 11, The above processor is, A three-dimensional object detection device that acquires the input image and the depth map from the stereo camera, which is the first camera, in the above process (I).
In Paragraph 11, The above processor is, A 3D object detection device that, in the above process (I), when the input image is acquired from the mono camera, which is the first camera, acquires a t-image corresponding to a specific frame of the input image and a (t-1)-image corresponding to any one of the previous frames of the specific frame, inputs the t-image and the (t-1)-image to a depth map generating deep learning model to cause the deep learning model to acquire at least one feature point corresponding to the same object in the t-image and the (t-1)-image, and uses the at least one feature point to predict the attitude change of the mono camera according to the t-position information of the mono camera at the t-time when the t-image is acquired and the (t-1)-position information of the mono camera at the t-1 time when the (t-1)-image is acquired, and outputs the depth map by predicting the distance of the at least one feature point by referring to the predicted attitude change.
In Paragraph 11, The above processor is a three-dimensional object detection device that further performs the process of continuously training the three-dimensional object detection model using the above synthetic image.

Description

Method for detecting 3-dimensional objects through synthesis of vehicle camera images and 3-dimensional object detection device using the same The present invention relates to a method for detecting a three-dimensional object by synthesizing an image captured from a camera installed in a vehicle with an image from a reference camera, and a three-dimensional object detection device using the same. Driver assistance systems or autonomous vehicles require the perception of the surrounding environment for the safety of the driver, and in particular, technology for recognizing the distance to target objects is essential to determine the risk of collision with objects on the driving path in advance and take appropriate measures. Recently, various studies on 3D object detection deep learning models have been conducted to detect the pose and position of objects from 2D images. Meanwhile, since the 3D object detection deep learning model is trained using training images captured with a fixed camera position, if the position of the camera installed in the vehicle to which the trained 3D object detection deep learning model is applied changes, the 3D object detection performance of the 3D object detection deep learning model may be degraded due to the difference from the learned information. Therefore, in order to secure reliable 3D object detection performance of a 3D object detection deep learning model, the position of the camera installed in the vehicle using the trained 3D object detection deep learning model can be made to be the same as the position of the camera that captured the training video used to train the 3D object detection deep learning model, but there is a problem in that this cannot be applied equally to all vehicles of different sizes and manufactured with different designs. In addition, to secure reliable 3D object detection performance for a 3D object detection deep learning model, the 3D object detection deep learning model can be trained using images captured by a camera installed on a vehicle to which the 3D object detection deep learning model is to be used. However, this has the problem of requiring significant cost and time because the 3D object detection deep learning model must be trained separately for each vehicle type. Accordingly, the applicant intends to propose a method to ensure the 3D object detection performance of a 3D object detection deep learning model even in images taken at a different location from the location of the camera used to take the training images used to train the 3D object detection deep learning model. The drawings attached below for use in describing embodiments of the present invention are merely some of the embodiments of the present invention, and other drawings can be obtained based on these drawings without inventive work by a person skilled in the art to which the present invention pertains (hereinafter "person skilled in the art"). FIG. 1 schematically illustrates a 3D object detection device that performs 3D object detection through vehicle camera image synthesis according to an embodiment of the present invention, and FIG. 2 schematically illustrates a method for detecting three-dimensional objects through image synthesis of a vehicle camera according to an embodiment of the present invention, and FIG. 3 schematically illustrates the state of acquiring a depth map in a method for detecting a three-dimensional object through vehicle camera image synthesis according to an embodiment of the present invention, and FIG. 4 schematically illustrates the state of generating a composite image in a 3D object detection method through camera image synthesis according to an embodiment of the present invention, and FIG. 5 schematically illustrates the state of projecting pixels of an input image onto a reference image in a 3D object detection method through camera image synthesis according to an embodiment of the present invention, and FIG. 6 schematically illustrates the process of detecting a 3D object using a composite image in a 3D object detection method through camera image synthesis according to an embodiment of the present invention. The following detailed description of the invention refers to the accompanying drawings, which illustrate specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that various embodiments of the invention are different but need not be mutually exclusive. For example, specific shapes, structures, and characteristics described herein may be modified from one embodiment to another without departing from the spirit and scope of the invention. It should also be understood that the location or arrangement of individual components within each embodiment may be modified without departing from the spirit and scope of the invention. Accordingly, the following detailed description is not intende