CN-115270918-B - Target detection method, method and device for establishing time-associated perception model

CN115270918BCN 115270918 BCN115270918 BCN 115270918BCN-115270918-B

Abstract

The embodiment of the application discloses a target detection method, a method and a device for establishing a time-associated perception model. The method comprises the main technical scheme of obtaining multi-frame sensor data, respectively carrying out target detection on each frame of sensor data in the multi-frame sensor data to obtain a first target detection result of each frame of sensor data, carrying out target tracking on the first target detection result of each frame of sensor data to correlate information of the same target to obtain a detection result sequence of each target, and carrying out second correction prediction on the first target detection result of each frame of sensor data based on time sequence correlation of the information of the same target in the detection result sequence of each target to obtain a third target detection result of each frame of sensor data. The technical scheme provided by the application can improve the accuracy of target detection.

Inventors

ZHANG DA
WU YUHUAN
MIAO ZHENWEI
ZHAN XIN
QING QUAN
YUAN TINGTING

Assignees

阿里巴巴达摩院(杭州)科技有限公司

Dates

Publication Date: 20260512
Application Date: 20220620

Claims (13)

1. A method of target detection, the method comprising: Acquiring multi-frame sensor data; The method comprises the steps of generating a plurality of candidate areas for each frame of sensor data in the multi-frame sensor data, extracting the characteristics of the candidate areas for each frame of sensor data through a deep convolutional neural network, obtaining second target detection results of each frame of sensor data by using the extracted characteristics, respectively inputting each frame of sensor data and the second target detection results thereof into a space association perception model, and carrying out first correction prediction on the second target detection results of the current input frame of sensor data based on the space association among targets in the second target detection results of the current input frame of sensor data by the space association perception model to obtain the first target detection results of the current frame of sensor data; Performing target tracking on a first target detection result of each frame of sensor data to correlate information of the same target, so as to obtain a detection result sequence of each target; And carrying out second correction prediction on the first target detection result of each frame of sensor data based on time sequence association of information of the same target in the detection result sequence of each target, so as to obtain a third target detection result of each frame of sensor data.
2. The method of claim 1, wherein performing object detection on each frame of sensor data in the plurality of frames of sensor data, respectively, to obtain a first object detection result for each frame of sensor data comprises: Generating a plurality of candidate regions for each frame of sensor data in the plurality of frames of sensor data; and extracting the characteristics of the candidate areas of the sensor data of each frame through the deep convolutional neural network, and obtaining a first target detection result of the sensor data of each frame by utilizing the extracted characteristics.
3. The method of claim 1, wherein the spatial correlation perception model performs a first modified prediction of a second target detection result of current input frame sensor data based on spatial correlation between targets in the second target detection result of the current input frame sensor data, comprising: Establishing an information graph of the current input frame sensor data by utilizing a second target detection result of the current input frame sensor data, and inputting the information graph into the space association perception model, wherein the information graph comprises nodes and edges between the nodes, the nodes comprise targets detected in the current input frame sensor data, and the edges represent the association between the nodes; And extracting features of the information graph by using the spatial correlation perception model through a graph convolutional neural network, and performing first correction prediction on the mapping layer through the extracted features to obtain a first target detection result of the sensor data of the current input frame.
4. A method according to claim 3, wherein said feature extraction of said information graph using a graph convolutional neural network comprises: calculating the characteristics of each edge in the information graph by carrying out nonlinear transformation on the characteristics of the nodes at the two ends of the edge in each iteration, and fusing the edges connected with the nodes in a pooling mode to obtain the characteristics of the node in the next iteration; And after the iteration is finished, combining the characteristics of the same node obtained in all iterations to obtain the characteristics of the node extracted from the information graph to be provided for the mapping layer.
5. The method of claim 1, wherein performing a second modified prediction of the first target detection result for each frame of sensor data based on a time-sequential correlation of information for the same target in the sequence of detection results for each target comprises: Respectively inputting sensor data of each frame and a first target detection result thereof into a time-associated perception model; The time-associated perception model utilizes the contextual target characteristics of the current input frame sensor data to respectively process the characteristics of each target in the first target detection result of the current input frame sensor data by a self-attention mechanism to obtain the characteristic representation of each target of the current input frame sensor data, wherein the contextual target characteristics of the current input frame sensor data comprise the characteristics of each target in N frames of sensor data before and/or after the current input frame sensor data, N is a preset positive integer, and the characteristic representation mapping of each target of the current input frame sensor data is utilized to obtain the third target detection result of the current input frame sensor data.
6. The method of claim 1, wherein the target tracking further obtains a confidence level of a detection result sequence of each target; Based on the time sequence association of the information of the same target in the detection result sequence of each target, performing second correction prediction on the first target detection result of each frame of sensor data comprises: determining a target corresponding to a detection result sequence with the confidence coefficient being greater than or equal to a preset confidence coefficient threshold value as a target to be corrected; and carrying out second correction prediction on the information of the target to be corrected in the first target detection result of each frame of sensor data based on the time sequence correlation reflected by the detection result sequence of the target to be corrected.
7. The method of any one of claims 1 to 6, wherein the sensor data comprises point cloud data; The target detection result includes category information, position information, size information, and orientation information of each target.
8. A method of building a time-dependent perceptual model, the method comprising: The method comprises the steps of obtaining a second training sample, wherein the second training sample comprises multi-frame sensor data, a first target detection result obtained by carrying out target detection on the multi-frame sensor data, a detection result sequence of each target obtained by carrying out target tracking on the first target detection result of each frame sensor data, and a label marked on each target information in the multi-frame sensor data, the first target detection result obtained by carrying out target detection on the multi-frame sensor data comprises the steps of generating a plurality of candidate areas on each frame sensor data in the multi-frame sensor data, carrying out feature extraction on the candidate areas on each frame sensor data through a depth convolution neural network, obtaining a second target detection result of each frame sensor data by utilizing the extracted features, and respectively inputting each frame sensor data and the second target detection result thereof into a space association perception model; Training by using the second training sample to obtain a time-associated perception model; the time-associated perception model carries out second correction prediction on the first target detection result of each frame of sensor data based on time-sequence association of information of the same target in the detection result sequence of each target to obtain a third target detection result of each frame of sensor data; the training objective includes minimizing a difference between a third objective detection result of the frames of sensor data and a corresponding label.
9. The method of claim 8, wherein the time-dependent perceptual model performs a second modified prediction of the first target detection result of each frame of sensor data based on a time-dependent correlation of information of the same target in the sequence of detection results of each target, comprising: Respectively inputting sensor data of each frame and a first target detection result thereof into a time-associated perception model; The time-associated perception model utilizes the contextual target characteristics of the current input frame sensor data to respectively process the characteristics of each target in the first target detection result of the current input frame sensor data by a self-attention mechanism to obtain the characteristic representation of each target of the current input frame sensor data, wherein the contextual target characteristics of the current input frame sensor data comprise the characteristics of each target in N frames of sensor data before and/or after the current input frame sensor data, N is a preset positive integer, and the characteristic representation mapping of each target of the current input frame sensor data is utilized to obtain the third target detection result of the current input frame sensor data.
10. An object detection device, comprising: a data acquisition module configured to acquire a plurality of frames of sensor data; The target detection module is configured to respectively detect targets of all frames of sensor data in the multi-frame sensor data to obtain a first target detection result of all frames of sensor data, and comprises the steps of generating a plurality of candidate areas for all frames of sensor data in the multi-frame sensor data, extracting the characteristics of the candidate areas for all frames of sensor data through a deep convolutional neural network, obtaining a second target detection result of all frames of sensor data by utilizing the extracted characteristics, respectively inputting all frames of sensor data and the second target detection result thereof into a space association perception model, and carrying out first correction prediction on the second target detection result of the current input frame of sensor data based on the space association between all targets in the second target detection result of the current input frame of sensor data by the space association perception model to obtain the first target detection result of the current frame of sensor data; The target tracking module is configured to track the first target detection result of each frame of sensor data so as to correlate information of the same target and obtain a detection result sequence of each target; the time sequence association sensing module is configured to perform second correction prediction on the first target detection result of each frame of sensor data based on time sequence association of information of the same target in the detection result sequence of each target, so as to obtain a third target detection result of each frame of sensor data.
11. An apparatus for establishing a time-dependent aware network, the apparatus comprising: The second sample acquisition module is configured to acquire a second training sample, wherein the second training sample comprises multi-frame sensor data, a first target detection result obtained by carrying out target detection on the multi-frame sensor data, a detection result sequence of each target obtained by carrying out target tracking on the first target detection result of each frame sensor data, and a label for labeling each target information in the multi-frame sensor data; the first target detection result obtained by carrying out target detection on the multi-frame sensor data comprises that a plurality of candidate areas are generated for each frame of sensor data in the multi-frame sensor data; the method comprises the steps of carrying out feature extraction of candidate areas on each frame of sensor data through a deep convolutional neural network, obtaining a second target detection result of each frame of sensor data by utilizing the extracted features, respectively inputting each frame of sensor data and the second target detection result thereof into a space association sensing model, carrying out first correction prediction on the second target detection result of the current input frame of sensor data by the space association sensing model based on the space association between each target in the second target detection result of the current input frame of sensor data, and obtaining a first target detection result of the current frame of sensor data; The time correlation sensing network carries out second correction prediction on the first target detection result of each frame of sensor data based on time sequence correlation of information of the same target in the detection result sequence of each target to obtain a third target detection result of each frame of sensor data, and the trained target comprises the step of minimizing the difference between the third target detection result of each frame of sensor data and a corresponding label.
12. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 9.
13. An electronic device, comprising: One or more processors, and A memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of claims 1 to 9.

Description

Target detection method, method and device for establishing time-associated perception model Technical Field The application relates to the technical field of artificial intelligence, in particular to a target detection method, a method and a device for establishing a time-associated perception model. Background The main application scenario of target detection at present is the fields of automatic driving, robots and the like. Object detection is one of the important components of an autonomous system, which requires that the autonomous vehicle not only recognize the type of obstacle, but also precise position and orientation information of the object to provide a reasonable route to the planning control module. The target detection of the autonomous vehicle is mainly based on sensors, including lidar, millimeter wave radar, onboard cameras, etc. Various sensors can acquire multiple frames of sensor data, and then respectively detect targets from the multiple frames of sensor data through a target detection algorithm. Although some target detection algorithms based on sensor data exist at present, there is still a need for improvement in detection accuracy. Disclosure of Invention In view of the above, the present application provides a target detection method, a method and a device for establishing a time-dependent perception model, so as to improve the accuracy of target detection. The application provides the following scheme: in a first aspect, there is provided a target detection method, the method comprising: Acquiring multi-frame sensor data; Respectively carrying out target detection on each frame of sensor data in the multi-frame sensor data to obtain a first target detection result of each frame of sensor data; Performing target tracking on a first target detection result of each frame of sensor data to correlate information of the same target, so as to obtain a detection result sequence of each target; And carrying out second correction prediction on the first target detection result of each frame of sensor data based on time sequence association of information of the same target in the detection result sequence of each target, so as to obtain a third target detection result of each frame of sensor data. According to an implementation manner of the embodiment of the present application, performing target detection on each frame of sensor data in the multiple frames of sensor data, to obtain a first target detection result of each frame of sensor data includes: Generating a plurality of candidate regions for each frame of sensor data in the plurality of frames of sensor data; and extracting the characteristics of the candidate areas of the sensor data of each frame through the deep convolutional neural network, and obtaining a first target detection result of the sensor data of each frame by utilizing the extracted characteristics. According to an implementation manner of the embodiment of the present application, performing target detection on each frame of sensor data in the multiple frames of sensor data, to obtain a first target detection result of each frame of sensor data includes: The method comprises the steps of generating a plurality of candidate areas for each frame of sensor data in the multi-frame sensor data, extracting the characteristics of the candidate areas for each frame of sensor data through a deep convolutional neural network, and obtaining a second target detection result of each frame of sensor data by utilizing the extracted characteristics; respectively inputting sensor data of each frame and a second target detection result thereof into a space association perception model; And carrying out first correction prediction on the second target detection result of the current input frame sensor data based on the spatial correlation among targets in the second target detection result of the current input frame sensor data by the spatial correlation perception model to obtain a first target detection result of the current frame sensor data. According to an implementation manner of the embodiment of the present application, the performing, by the spatial correlation sensing model, the first correction prediction on the second target detection result of the current input frame sensor data based on the spatial correlation between the targets in the second target detection result of the current input frame sensor data includes: Establishing an information graph of the current input frame sensor data by utilizing a second target detection result of the current input frame sensor data, and inputting the information graph into the space association perception model, wherein the information graph comprises nodes and edges between the nodes, the nodes comprise targets detected in the current input frame sensor data, and the edges represent the association between the nodes; And extracting features of the information graph by using the spatial correlation perception model through a grap