CN-115546297-B - Monocular ranging method, monocular ranging device, electronic equipment and storage medium

CN115546297BCN 115546297 BCN115546297 BCN 115546297BCN-115546297-B

Abstract

The application provides a monocular distance measuring method, a monocular distance measuring device, electronic equipment and a storage medium, wherein the method comprises the steps of obtaining a video stream shot by a monocular camera, carrying out head detection on an image frame in the video stream, and generating a head detection frame representing the position of a head in the image frame; and according to the coordinate information of the human head detection frame in the world coordinate system, the internal reference matrix and the distortion parameter matrix of the monocular camera, carrying out pose estimation on the human head in the image frame by using a pose estimation module to obtain a translation matrix, and taking a Z-axis value in the translation matrix as the distance of the monocular camera relative to the human head. Based on the method, the coordinates of the detection frame are obtained through human head detection, and the distance between the monocular camera and the human head is obtained through PNP algorithm, so that the real-time performance, low cost and high precision of robot body following can be improved.

Inventors

FANG QIN
PANG JIANXIN

Assignees

深圳市优必选科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20220920

Claims (10)

1. A monocular ranging method, comprising: Acquiring a video stream shot by a monocular camera, detecting a human head of an image frame in the video stream, and generating a human head detection frame representing the position of the human head in the image frame; Acquiring coordinate information of the human head detection frame in a world coordinate system, and acquiring an internal reference matrix and a distortion parameter matrix of the monocular camera; According to the coordinate information of the human head detection frame in a world coordinate system and an internal reference matrix and a distortion parameter matrix of the monocular camera, a pose estimation module is used for carrying out pose estimation on the human head in the image frame to obtain a translation matrix, and a Z-axis value in the translation matrix is used as the distance between the monocular camera and the human head; after the step of generating the human head detection frame representing the position of the human head in the image frame, the method further comprises the following steps: Judging whether a missed detection image frame exists in the video stream; if yes, counting the number of missed image frames which are continuously missed in the video stream, obtaining a first number, and judging whether the first number is larger than a preset number threshold; if the first number is greater than a preset number threshold, acquiring historical detection image frames for generating a head detection frame from the video stream according to the first number to generate an image set to be sampled, wherein the number of the historical detection image frames contained in the image set to be sampled is consistent with the first number; setting index values corresponding to each history detection image frame according to the position distance between each history detection image frame and the first omission detection image frame, wherein the size of the index values is inversely related to the position distance; determining the weight value corresponding to each history detection image frame according to the index value corresponding to each history detection image frame; Acquiring a second number Zhang Caiyang of frames from the image set to be sampled according to a preset sampling rule, correspondingly acquiring coordinate information of each of the second number of head detection frames in a world coordinate system and each corresponding weight value according to the second number Zhang Caiyang of frames, setting a value of the second number and a ratio of the number of fixed mode samples to the number of random mode samples in the preset sampling rule, wherein the second number = the fixed mode samples plus the number of random samples; and according to the coordinate information of the second number of the individual head detection frames in the world coordinate system and the weight values corresponding to the second number of the individual head detection frames, multiplying the horizontal coordinate value and the longitudinal coordinate value of each individual head detection frame by the corresponding weight values respectively, then carrying out corresponding summation, dividing the horizontal coordinate value and the longitudinal coordinate value obtained after summation by the second number respectively to obtain the horizontal coordinate value and the longitudinal coordinate value of the compensation detection frame, and generating a corresponding compensation detection frame in the missed detection image frame according to the horizontal coordinate value and the longitudinal coordinate value of the compensation detection frame.
2. The monocular distance measuring method of claim 1, wherein the step of obtaining historical detection image frames for generating a human head detection frame from the video stream according to the first number to generate the image set to be sampled comprises: acquiring the position of a first missed detection image frame in the video stream; According to the position, starting from a history detection image frame adjacent to the first missed detection image frame in the video stream, and acquiring a first number of history detection image frames from near to far according to the position distance; And carrying out aggregation processing on the first number of historical detection image frames to generate an image set to be sampled.
3. The monocular ranging method according to claim 1 or 2, wherein the step of determining the weight value corresponding to each of the history detection image frames according to the index value corresponding to each of the history detection image frames, comprises: carrying out normalized scoring processing on each history detection image frame according to the index value corresponding to each history detection image frame to obtain the score corresponding to each history detection image frame; Multiplying the respective score of each history detection image frame by the second number to obtain the respective weight value of each history detection image frame.
4. The monocular distance measuring method of claim 1, wherein after the step of counting the number of missed image frames that are continuously missed in the video stream, obtaining a first number, and determining whether the first number is greater than a preset number threshold, further comprises: And if the first number is smaller than or equal to a preset number threshold, acquiring a history detection image frame adjacent to the first missed detection image frame from the video stream, and setting a human head detection frame generated in the history detection image frame as a compensation detection frame of the missed detection image frame.
5. The monocular distance measurement method of claim 1, wherein after the step of taking the Z-axis value in the translation matrix as the distance of the monocular camera relative to the human head, comprising: and correcting the distance of the monocular camera relative to the head according to a preset fitting curve for representing the association relation between the distance and the prediction error.
6. The monocular distance measurement method of claim 1, wherein the step of performing head detection on the image frames in the video stream, before generating a head detection frame in the image frames that characterizes a position of a head, further comprises: and performing image calibration processing on the image frame by using a correction function.
7. The monocular distance measuring method of claim 1, wherein the step of performing head detection on the image frames in the video stream comprises: and adopting a lightweight human head detection model to detect human heads of the image frames in the video stream, wherein a shufflenetv network or a esnet network is used as a detection network in the lightweight human head detection model.
8. A monocular distance measuring device, comprising: the detection module is used for acquiring a video stream shot by the monocular camera, detecting a human head of an image frame in the video stream, and generating a human head detection frame representing the position of the human head in the image frame; The acquisition module is used for acquiring coordinate information of the human head detection frame in a world coordinate system, and an internal reference matrix and a distortion parameter matrix of the monocular camera; The ranging module is used for estimating the pose of the head in the image frame by using the pose estimation module according to the coordinate information of the head detection frame in a world coordinate system and the internal reference matrix and the distortion parameter matrix of the monocular camera, so as to obtain a translation matrix, and taking a Z-axis value in the translation matrix as the distance between the monocular camera and the head; The detection module is further configured to determine whether a missed detection image frame exists in the video stream after the step of generating a human head detection frame representing a position of a human head in the image frame; if yes, counting the number of missed image frames which are continuously missed in the video stream, obtaining a first number, and judging whether the first number is larger than a preset number threshold; if the first number is greater than a preset number threshold, acquiring historical detection image frames for generating a head detection frame from the video stream according to the first number to generate an image set to be sampled, wherein the number of the historical detection image frames contained in the image set to be sampled is consistent with the first number; setting index values corresponding to each history detection image frame according to the position distance between each history detection image frame and the first omission detection image frame, wherein the size of the index values is inversely related to the position distance; determining the weight value corresponding to each history detection image frame according to the index value corresponding to each history detection image frame; Acquiring a second number Zhang Caiyang of frames from the image set to be sampled according to a preset sampling rule, correspondingly acquiring coordinate information of each of the second number of head detection frames in a world coordinate system and each corresponding weight value according to the second number Zhang Caiyang of frames, setting a value of the second number and a ratio of the number of fixed mode samples to the number of random mode samples in the preset sampling rule, wherein the second number = the fixed mode samples plus the number of random samples; and according to the coordinate information of the second number of the individual head detection frames in the world coordinate system and the weight values corresponding to the second number of the individual head detection frames, multiplying the horizontal coordinate value and the longitudinal coordinate value of each individual head detection frame by the corresponding weight values respectively, then carrying out corresponding summation, dividing the horizontal coordinate value and the longitudinal coordinate value obtained after summation by the second number respectively to obtain the horizontal coordinate value and the longitudinal coordinate value of the compensation detection frame, and generating a corresponding compensation detection frame in the missed detection image frame according to the horizontal coordinate value and the longitudinal coordinate value of the compensation detection frame.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.

Description

Monocular ranging method, monocular ranging device, electronic equipment and storage medium Technical Field The present application relates to the field of artificial intelligence technologies, and in particular, to a monocular ranging method, a monocular ranging device, an electronic apparatus, and a storage medium. Background With the development of artificial intelligence technology, artificial intelligence robots are widely used in various fields. The artificial intelligent robot needs to drive itself to perform a moving or stopping action through ranging when performing a human body following operation. At present, the existing ranging technical scheme comprises ranging based on a binocular camera, ranging based on a pose estimation network, ranging based on a monocular depth estimation network and the like, however, when the binocular camera is used for ranging, the distance between a person and the camera is needed to be obtained by utilizing a stereo matching algorithm, the cost is high, the speed of the stereo matching algorithm is low, the instantaneity is poor, the situation of inaccurate ranging can also occur under the condition of darker environment, the calculation complexity of a following task can be increased by using the pose estimation network, the effect of real-time ranging is difficult to achieve by using the pose estimation network with high precision, the monocular depth estimation network has high requirements on a training data set, and a depth map with high quality needs to be acquired in advance so that the network can finally predict a depth estimation value with high precision. The depth estimation network also has the problem of low real-time performance, and the input cost is high. Disclosure of Invention In view of the above, the embodiment of the application provides a monocular ranging system, a monocular ranging method, a monocular ranging device, monocular ranging equipment and a storage medium, which can improve the real-time performance of robot body following, reduce the cost and improve the ranging precision. A first aspect of an embodiment of the present application provides a monocular ranging method, including: Acquiring a video stream shot by a monocular camera, detecting a human head of an image frame in the video stream, and generating a human head detection frame representing the position of the human head in the image frame; Acquiring coordinate information of the human head detection frame in a world coordinate system, and acquiring an internal reference matrix and a distortion parameter matrix of the monocular camera; and according to the coordinate information of the human head detection frame in a world coordinate system and the internal reference matrix and the distortion parameter matrix of the monocular camera, performing pose estimation on the human head in the image frame by using a pose estimation module to obtain a translation matrix, and taking a Z-axis value in the translation matrix as the distance between the monocular camera and the human head. With reference to the first aspect, in a first possible implementation manner of the first aspect, the step of obtaining a video stream captured by a monocular camera, performing head detection on an image frame in the video stream, and generating a head detection frame representing a position of a head in the image frame further includes: Judging whether a missed detection image frame exists in the video stream; if the detection frame exists, carrying out detection frame compensation processing on the missed detection image frame according to a preset compensation rule to obtain a compensation detection frame, and generating the compensation detection frame into a human head detection frame representing the position of the human head in the missed detection image frame. With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the step of performing detection frame compensation processing on the missed detection image frame according to a preset compensation rule to obtain a compensated detection frame includes: counting the number of missed detection image frames which are continuously missed in the video stream, obtaining a first number, and judging whether the first number is larger than a preset number threshold; if the first number is greater than a preset number threshold, acquiring historical detection image frames for generating a head detection frame from the video stream according to the first number to generate an image set to be sampled, wherein the number of the historical detection image frames contained in the image set to be sampled is consistent with the first number; setting index values corresponding to each history detection image frame according to the position distance between each history detection image frame and the first omission detection image frame, wherein the size of the index values is