CN-122024279-A - Pedestrian target detection method, device, electronic equipment and medium

CN122024279ACN 122024279 ACN122024279 ACN 122024279ACN-122024279-A

Abstract

The application provides a pedestrian target detection method, a device, electronic equipment and a medium, which are used for acquiring a continuous image sequence with preset length acquired by a multi-view camera, extracting multi-scale image features for each frame of image, projecting the multi-scale image features to a bird's-eye view space to generate a first time sequence bird's-eye view feature sequence, generating a second time sequence bird's-eye view feature sequence based on the first time sequence bird's-eye view feature sequence and example feature feedback information, wherein the example feature feedback information is used for describing the features of a pedestrian example, carrying out time sequence fusion on the second time sequence bird's-eye view feature sequence to obtain a first fused bird's-eye view feature, generating a three-dimensional pedestrian candidate frame according to the first fused bird's-eye view feature, and extracting example features corresponding to the three-dimensional pedestrian candidate frame. The example feature feedback information can guide the model to focus on the pedestrian area, inhibit background noise interference, improve the accuracy of the second time sequence aerial view feature sequence, and further improve the accuracy of subsequent pedestrian positioning.

Inventors

XU BOCHENG
ZHANG CAO
LI YANG
SU XINGYI

Assignees

重庆凤凰技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (10)

1. A pedestrian target detection method, the method comprising: acquiring a continuous image sequence with preset length acquired by a multi-view camera, and extracting multi-scale image features for each frame of image; projecting the multi-scale image features to a bird's-eye view space to generate a first time sequence bird's-eye view feature sequence; generating a second time sequence aerial view characteristic sequence based on the first time sequence aerial view characteristic sequence and example characteristic feedback information, wherein the example characteristic feedback information is used for describing characteristics of a pedestrian example; Performing time sequence fusion on the second time sequence aerial view characteristic sequence to obtain a first fused aerial view characteristic; and generating a three-dimensional pedestrian candidate frame according to the first fused aerial view characteristic, and extracting example characteristics corresponding to the three-dimensional pedestrian candidate frame.
2. The method of claim 1, wherein generating a second temporal bird's-eye view feature sequence based on the first temporal bird's-eye view feature sequence and instance feature feedback information comprises: Initializing the first time sequence aerial view characteristic sequence as example characteristic feedback information when the second time sequence aerial view characteristic sequence is generated for the first time; generating an instance attention map according to the instance characteristic feedback information, wherein the instance attention map is used for representing the probability of pedestrian instances at different positions in a bird's eye view space; Fusing the example attention map with the first time sequence aerial view feature sequence to obtain a second fused aerial view feature; and carrying out time sequence context modeling on the second fused aerial view feature to obtain a second time sequence aerial view feature sequence.
3. The method of claim 2, wherein after modeling the second fused aerial view feature for a time series context to obtain a second time series aerial view feature sequence, the method further comprises: And when the second time sequence aerial view characteristic sequence is iterated, the extracted example characteristic is used as the example characteristic feedback information, and the second time sequence aerial view characteristic sequence is regenerated until a preset termination condition is met.
4. The method of claim 1, wherein generating a three-dimensional pedestrian candidate frame from the first fused aerial view feature and extracting an instance feature corresponding to the three-dimensional pedestrian candidate frame comprises: detecting the first fused aerial view features through a preset detection head to generate three-dimensional pedestrian candidate frames; And projecting the three-dimensional pedestrian candidate frame to a feature space of the first fused aerial view feature, and extracting example features corresponding to the three-dimensional pedestrian candidate frame.
5. The method of claim 4, wherein projecting the three-dimensional pedestrian candidate frame into a feature space of a first fused bird's-eye view feature, extracting example features corresponding to the three-dimensional pedestrian candidate frame, comprises: reversely projecting the central coordinates of the three-dimensional pedestrian candidate frames onto the first fused aerial view feature to obtain two-dimensional reference point coordinates; Predicting the position offset of a plurality of sampling points based on the three-dimensional pedestrian candidate frame and an example attention map; determining a target sampling position according to the two-dimensional reference point coordinates and the position offset; And extracting the characteristics of the target sampling position from the first fused aerial view characteristics to obtain example characteristics.
6. The method of claim 1, wherein after generating a three-dimensional pedestrian candidate frame from the first fused aerial view feature and extracting an example feature corresponding to the three-dimensional pedestrian candidate frame, the method further comprises: inputting the extracted example features into a prediction head network; And carrying out regression and classification on the three-dimensional pedestrian candidate frames through the prediction head network to obtain the optimized three-dimensional pedestrian detection frames and the pedestrian detection confidence.
7. The method according to claim 1, wherein the method further comprises: Calculating instance comparison learning loss based on the instance characteristics, wherein the instance comparison learning loss is used for restraining the characteristic representations of the same pedestrian instance in different time frames to be close to each other in a characteristic space, and the characteristic representations of different pedestrian instances to be far away from each other in the characteristic space; optimizing the projection of the multi-scale image features to a bird's eye view space based on the example contrast learning loss.
8. A pedestrian target detection apparatus, the apparatus comprising: The acquisition module is used for acquiring a continuous image sequence with preset length acquired by the multi-view camera and extracting multi-scale image characteristics for each frame of image; The projection module is used for projecting the multi-scale image features to a bird's-eye view space and generating a first time sequence bird's-eye view feature sequence; the generation module is used for generating a second time sequence aerial view characteristic sequence based on the first time sequence aerial view characteristic sequence and example characteristic feedback information, wherein the example characteristic feedback information is used for describing the characteristics of a pedestrian example; The fusion module is used for carrying out time sequence fusion on the second time sequence aerial view characteristic sequence to obtain a first fusion aerial view characteristic; And the detection module is used for generating a three-dimensional pedestrian candidate frame according to the first fused aerial view characteristic and extracting example characteristics corresponding to the three-dimensional pedestrian candidate frame.
9. An electronic device is characterized by comprising a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method of any one of claims 1 to 7.
10. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of claims 1 to 7.

Description

Pedestrian target detection method, device, electronic equipment and medium Technical Field The embodiment of the application relates to the field of intelligent robots, in particular to a pedestrian target detection method, a device, electronic equipment and a medium. Background When an automobile or an intelligent body executes automatic driving, the positions, the postures and the like of surrounding pedestrians are required to be detected through multi-mode sensor data such as laser radar, cameras and the like. In the related art, an image or a point cloud feature is projected to a BEV (Bird's Eye View) space to construct a global feature map, and pedestrians are detected. The method is easy to generate example feature confusion in a pedestrian dense scene, and has low detection precision. Disclosure of Invention In view of the foregoing, embodiments of the present invention are directed to providing a pedestrian target detection method, apparatus, electronic device, and medium that overcome or at least partially solve the foregoing problems. In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: in a first aspect, an embodiment of the present application discloses a pedestrian target detection method, the method including: acquiring a continuous image sequence with preset length acquired by a multi-view camera, and extracting multi-scale image features for each frame of image; projecting the multi-scale image features to a bird's-eye view space to generate a first time sequence bird's-eye view feature sequence; generating a second time sequence aerial view characteristic sequence based on the first time sequence aerial view characteristic sequence and example characteristic feedback information, wherein the example characteristic feedback information is used for describing characteristics of a pedestrian example; Performing time sequence fusion on the second time sequence aerial view characteristic sequence to obtain a first fused aerial view characteristic; and generating a three-dimensional pedestrian candidate frame according to the first fused aerial view characteristic, and extracting example characteristics corresponding to the three-dimensional pedestrian candidate frame. In a second aspect, an embodiment of the present application discloses a pedestrian target detection apparatus, including: The acquisition module is used for acquiring a continuous image sequence with preset length acquired by the multi-view camera and extracting multi-scale image characteristics for each frame of image; The projection module is used for projecting the multi-scale image features to a bird's-eye view space and generating a first time sequence bird's-eye view feature sequence; the generation module is used for generating a second time sequence aerial view characteristic sequence based on the first time sequence aerial view characteristic sequence and example characteristic feedback information, wherein the example characteristic feedback information is used for describing the characteristics of a pedestrian example; The fusion module is used for carrying out time sequence fusion on the second time sequence aerial view characteristic sequence to obtain a first fusion aerial view characteristic; And the detection module is used for generating a three-dimensional pedestrian candidate frame according to the first fused aerial view characteristic and extracting example characteristics corresponding to the three-dimensional pedestrian candidate frame. In a third aspect, an embodiment of the present application discloses an electronic device, including a processor and a memory, the memory storing a program or instructions executable on the processor, the program or instructions implementing the steps of the method according to the first aspect when executed by the processor. In a fourth aspect, embodiments of the present application disclose a readable storage medium having stored thereon a program or instructions which when executed by a processor implement the steps of the method according to the first aspect. In the embodiment of the application, a continuous image sequence with preset length acquired by a multi-view camera is acquired, multi-scale image features are extracted for each frame image, the multi-scale image features are projected to a bird's-eye view space to generate a first time sequence bird's-eye view feature sequence, a second time sequence bird's-eye view feature sequence is generated based on the first time sequence bird's-eye view feature sequence and example feature feedback information, the example feature feedback information is used for describing the features of a pedestrian example, the second time sequence bird's-eye view feature sequence is subjected to time sequence fusion to obtain a first fused bird's-eye view feature, a three-dimensional pedestrian candidate frame is generated according to the first fused bird's-eye view feature, and exampl