CN-117315645-B - Three-dimensional target detection method, system, equipment and medium for time sequence fusion

CN117315645BCN 117315645 BCN117315645 BCN 117315645BCN-117315645-B

Abstract

The embodiment of the invention provides a three-dimensional target detection method, a system, equipment and a medium for time sequence fusion, wherein the method comprises the steps of converting a view angle of a plurality of coding features obtained based on a view angle camera into a view angle of a first coding feature in the plurality of coding features to obtain a plurality of corresponding target coding features; the method comprises the steps of carrying out feature superposition on a plurality of target coding features and a first coding feature to obtain camera features at a moment corresponding to the first coding feature, carrying out feature space conversion on the camera features of a plurality of view cameras to obtain corresponding multi-view space features, and inputting the multi-view space features into a three-dimensional detection head for processing to obtain a three-dimensional target detection result. The depth perception accuracy of the three-dimensional target is improved, and the accuracy of three-dimensional target detection is further improved.

Inventors

QI SHENGXIANG
DONG NAN

Assignees

重庆长安汽车股份有限公司

Dates

Publication Date: 20260508
Application Date: 20230927

Claims (11)

1. A method for detecting a three-dimensional object with long time sequence fusion, the method comprising: Converting a view of a plurality of coding features obtained based on a view camera into a view of a first coding feature of the plurality of coding features, obtaining a corresponding plurality of target coding features; performing feature superposition on the target coding features and the first coding features to obtain camera features at the moment corresponding to the first coding features; obtaining corresponding multi-view space features by performing feature space conversion on camera features of the cameras with multiple view angles; inputting the multi-eye space features into a three-dimensional detection head for processing to obtain a three-dimensional target detection result; The obtaining corresponding multi-view spatial features by performing feature space conversion on camera features of a plurality of view cameras includes: monocular depth estimation is carried out on the camera features through a monocular depth estimation model, and a monocular depth estimation result is obtained; performing multi-order depth estimation on the camera features through a multi-order depth estimation model to obtain a multi-order depth estimation result; obtaining a depth map by summing the monocular depth estimation result and the multi-eye depth estimation result; fusing and viewing angle conversion are carried out on the respective depth maps of the plurality of view cameras and the context information in the corresponding multi-view depth estimation results through cylinder pooling, so that multi-view spatial characteristics at corresponding moments are obtained; before obtaining a depth map by summing the monocular depth estimation result and the multi-ocular depth estimation result, the method further comprises: Inputting the depth center in the monocular depth estimation result and the multi-objective depth estimation result into a weight network for processing to obtain a weight graph; processing through a three-dimensional depth model according to the weight map and the depth center and the depth range in the multi-order depth estimation result to obtain a target multi-order depth estimation result; the step of obtaining a depth map by summing the monocular depth estimation result and the multi-eye depth estimation result includes: And obtaining a depth map by summing the monocular depth estimation result and the target multi-objective depth estimation result.
2. The method for detecting a three-dimensional object by long time sequence fusion according to claim 1, wherein before inputting the multi-dimensional features into a three-dimensional detection head for processing, the method further comprises: converting the view angles of a plurality of multi-view spatial features into view angles of a first multi-view spatial feature in the plurality of multi-view spatial features, and obtaining a plurality of corresponding target multi-view spatial features; performing feature stitching on the multiple target multi-view spatial features and the first multi-view spatial features to obtain time sequence fusion features of the first multi-view spatial features at corresponding moments; Inputting the multi-eye space features into a three-dimensional detection head for processing to obtain a three-dimensional target detection result, wherein the three-dimensional target detection result comprises: And inputting the time sequence fusion characteristics into a three-dimensional detection head for processing to obtain a three-dimensional target detection result.
3. The method according to claim 1, wherein converting the view of the plurality of coding features obtained based on the view camera into the view of the first coding feature of the plurality of coding features to obtain the corresponding plurality of target coding features, comprises: Determining an internal parameter of a visual angle camera and an external parameter of the visual angle camera relative to a vehicle body coordinate system; Determining a plurality of coding features of a plurality of moments before and after a first coding feature and a moment corresponding to the first coding feature obtained based on the view camera; and converting the view angles of the plurality of coding features into view angles of the first coding features through homography according to the internal reference, the external reference and the first coding features to obtain a plurality of corresponding target coding features.
4. The method for detecting a three-dimensional object with long time sequence fusion according to claim 1, wherein the feature stacking the plurality of object coding features and the first coding feature to obtain a camera feature at a time corresponding to the first coding feature comprises: Performing feature superposition on the target coding features and the first coding features to obtain initial camera features at the moment corresponding to the first coding features; and performing dimension reduction convolution processing on the initial camera feature through target convolution check to obtain the camera feature at the corresponding moment.
5. The method for detecting a three-dimensional object with long time sequence fusion according to claim 2, wherein converting the view angles of the plurality of multi-view spatial features into the view angle of the first multi-view spatial feature in the plurality of multi-view spatial features to obtain the corresponding plurality of object multi-view spatial features comprises: Determining a first multi-view spatial feature and a plurality of multi-view spatial features at a plurality of moments before and after a moment corresponding to the first multi-view spatial feature; And converting the view angles of the multiple multi-view spatial features into view angles of the first multi-view spatial features through homography transformation according to the first multi-view spatial features, and obtaining corresponding multiple target multi-view spatial features.
6. The method for detecting a three-dimensional object by long time sequence fusion according to claim 2, wherein the step of inputting the time sequence fusion feature into a three-dimensional detection head for processing to obtain a three-dimensional object detection result comprises the steps of: and respectively inputting the time sequence fusion characteristics at a plurality of moments into the corresponding three-dimensional detection heads for processing to obtain a plurality of corresponding three-dimensional target detection results.
7. The method for detecting a three-dimensional object by long time sequence fusion according to claim 2, wherein the step of inputting the multi-dimensional spatial features into a three-dimensional detection head for processing to obtain a three-dimensional object detection result comprises the steps of: And respectively inputting the multiple spatial features at multiple moments into corresponding three-dimensional detection heads for processing to obtain corresponding multiple three-dimensional target detection results.
8. The method for three-dimensional object detection with long time sequence fusion according to claim 1, wherein training of the monocular depth estimation model comprises: The method comprises the steps of constructing a model training sample set and a model verification sample set by carrying out depth labeling on a preset number of sample camera features; inputting the camera characteristics of each target sample in the model training sample set into a constructed initial monocular depth estimation model for training to obtain a corresponding depth prediction result; determining whether the initial monocular depth estimation model is qualified in training according to the depth prediction result, the corresponding depth label and a preset loss function; verifying the initial monocular depth estimation model which is qualified in training through the verification sample set; in case the verification is passed, the initial monocular depth estimation model is determined as a monocular depth estimation model.
9. A long-time-sequence-fused three-dimensional object detection system, the system comprising: a view conversion module, configured to convert a view of a plurality of coding features obtained based on a view camera into a view of a first coding feature of the plurality of coding features, and obtain a corresponding plurality of target coding features; the feature superposition module is used for feature superposition of the plurality of target coding features and the first coding features to obtain camera features at the moment corresponding to the first coding features; The feature space conversion module is used for obtaining corresponding multi-view space features by carrying out feature space conversion on the camera features of the cameras with the multiple view angles; the target detection module is used for inputting the multi-eye space characteristics into a three-dimensional detection head for processing to obtain a three-dimensional target detection result; the feature space conversion module comprises: The monocular depth estimation module is used for carrying out monocular depth estimation on the camera characteristics through the monocular depth estimation model to obtain a monocular depth estimation result; The multi-order depth estimation module is used for carrying out multi-order depth estimation on the camera features through a multi-order depth estimation model to obtain a multi-order depth estimation result; the summation module is used for obtaining a depth map by carrying out summation processing on the monocular depth estimation result and the multi-eye depth estimation result; the feature space conversion sub-module is used for fusing and converting the depth map of each of the plurality of view cameras and the context information in the corresponding multi-view depth estimation result through cylinder pooling to obtain multi-view space features at corresponding moments; the system further comprises: the weight map determining module is used for inputting the monocular depth estimation result and the depth center in the multi-objective depth estimation result into a weight network for processing to obtain a weight map; The target multi-order depth estimation module is used for processing through a three-dimensional depth model according to the weight map and the depth center and the depth range in the multi-order depth estimation result to obtain a target multi-order depth estimation result; The summing module includes: And the summation sub-module is used for obtaining a depth map by carrying out summation processing on the monocular depth estimation result and the target multi-objective depth estimation result.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing a long time fusion three-dimensional object detection method according to any one of claims 1 to 8.
11. A computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements a long-time-series fusion three-dimensional object detection method according to any one of claims 1 to 8.

Description

Three-dimensional target detection method, system, equipment and medium for time sequence fusion Technical Field The invention relates to the technical field of three-dimensional target detection, in particular to a three-dimensional target detection method, system, equipment and medium for time sequence fusion. Background The automatic driving vehicle runs on the road autonomously, the surrounding three-dimensional scene is required to be perceived, and the three-dimensional target detection with long time sequence fusion is used for acquiring the position and type information of an object in the three-dimensional space, so that the automatic driving vehicle is a basis of an automatic driving perception system and has important guiding effects on subsequent path planning, motion prediction and collision avoidance. Inspired by success of a three-dimensional target detection algorithm based on long time sequence fusion of laser point cloud, a brand-new method based on false laser radar is provided for three-dimensional target detection based on long time sequence fusion of camera images. The three-dimensional target detection of the monocular or stereoscopic vision long time sequence fusion is realized by calculating parallax, projecting an image to a three-dimensional space again to obtain a pseudo laser radar, and then adopting a three-dimensional target detection algorithm of high-precision long time sequence fusion based on laser point cloud. Camera-based three-dimensional object detection has received widespread attention due to the stability and low cost of vision sensors. However, there is still a large performance gap compared to LiDAR-based three-dimensional object detection methods, as it exposes a well-known problem of perceived depth inaccuracy. Disclosure of Invention In view of this, the embodiments of the present invention provide a method, a system, an apparatus, and a medium for detecting a three-dimensional target with timing fusion, which aim to improve the accuracy of depth perception of the three-dimensional target, and further improve the accuracy of three-dimensional target detection. The first aspect of the embodiment of the invention provides a three-dimensional target detection method for long time sequence fusion, which comprises the following steps: Converting a view of a plurality of coding features obtained based on a view camera into a view of a first coding feature of the plurality of coding features, obtaining a corresponding plurality of target coding features; performing feature superposition on the target coding features and the first coding features to obtain camera features at the moment corresponding to the first coding features; obtaining corresponding multi-view space features by performing feature space conversion on camera features of the cameras with multiple view angles; And inputting the multi-eye space features into a three-dimensional detection head for processing to obtain a three-dimensional target detection result. Optionally, before the multi-eye spatial feature is input into the three-dimensional detection head for processing, and a three-dimensional target detection result is obtained, the method further includes: converting the view angles of a plurality of multi-view spatial features into view angles of a first multi-view spatial feature in the plurality of multi-view spatial features, and obtaining a plurality of corresponding target multi-view spatial features; performing feature stitching on the multiple target multi-view spatial features and the first multi-view spatial features to obtain time sequence fusion features of the first multi-view spatial features at corresponding moments; Inputting the multi-eye space features into a three-dimensional detection head for processing to obtain a three-dimensional target detection result, wherein the three-dimensional target detection result comprises: And inputting the time sequence fusion characteristics into a three-dimensional detection head for processing to obtain a three-dimensional target detection result. Optionally, the converting the view of the plurality of coding features obtained based on the view camera into the view of the first coding feature in the plurality of coding features, to obtain a corresponding plurality of target coding features includes: Determining an internal parameter of a visual angle camera and an external parameter of the visual angle camera relative to a vehicle body coordinate system; Determining a plurality of coding features of a plurality of moments before and after a first coding feature and a moment corresponding to the first coding feature obtained based on the view camera; and converting the view angles of the plurality of coding features into view angles of the first coding features through homography according to the internal reference, the external reference and the first coding features to obtain a plurality of corresponding target coding features. Optionally,