CN-121982743-A - Human body posture serial detection method, device, equipment and medium for time sequence ROI multiplexing

CN121982743ACN 121982743 ACN121982743 ACN 121982743ACN-121982743-A

Abstract

The invention provides a human body posture serial detection method, device, equipment and medium for time sequence ROI multiplexing, which comprises the steps of performing full-image target detection on key frames of an input video sequence, outputting human body boundary frame coordinates by adopting a lightweight target detection algorithm, filtering redundant frames with confidence coefficient lower than a preset threshold, then performing local posture estimation on human body boundary frame regions, outputting 35 key node coordinate sets and confidence coefficient thereof, filtering low-confidence-degree joint points with confidence coefficient lower than the preset threshold, performing local posture estimation on extended ROI regions of a previous frame multiplexed by adjacent non-key frames, wherein the extended ROI regions are obtained by expanding the minimum circumscribed rectangle of effective key nodes of the key frames in an equal proportion, sequentially processing the video frame sequence according to preset key frame intervals N, upgrading the current frame into the key frame when the full-image detection triggering condition is met, and guaranteeing that the processing speed of human body target detection and posture estimation in the video stream is greatly improved on the premise of basically not sacrificing precision.

Inventors

ZHANG DENGPAN
ZHANG ZHUMING

Assignees

恒鸿达(福建)体育科技有限公司

Dates

Publication Date: 20260505
Application Date: 20251217

Claims (10)

1. A human body posture serial detection method for time sequence ROI multiplexing is characterized by comprising the following steps: Step 1, performing full-image target detection on key frames of an input video sequence, outputting human body boundary frame coordinates (x 1, y1, x2, y 2) by adopting a lightweight target detection algorithm, filtering redundant frames with confidence coefficient lower than a preset threshold value, then performing local posture estimation on human body boundary frame areas, outputting 35 key joint point coordinate sets and confidence coefficients thereof, and filtering low-reliability joint points with the confidence coefficients lower than the preset joint point threshold value; Step 2, local pose estimation is carried out on an extended ROI area of a previous frame multiplexed by adjacent non-key frames, and the extended ROI area is obtained by equal proportion extension based on the minimum circumscribed rectangle of the key frame effective node; and 3, sequentially processing the video frame sequence according to a preset key frame interval N, and when the trigger condition of full-image detection is met, upgrading the current frame into a key frame and re-executing the step 1.
2. The method for serially detecting human body gestures multiplexed by a time sequence ROI according to claim 1, wherein the detecting of the whole image target in the step 1 specifically comprises the following steps: preprocessing the key frame image according to the input size requirement of the target detection model; Carrying out reasoning by adopting YOLOv-nano, SSD-MobileNetV class lightweight algorithm or YOLOv-large, fast R-CNN class high-precision algorithm; The output format is a single or multiple bounding boxes of a human pixel coordinate format, and supports multi-human scene detection; the local posture estimation in step 1 specifically includes: Cutting and scaling the human body boundary box area to the fixed input size of the gesture estimation model, normalizing pixels and normalizing mean variance; the Top-Down class algorithm is adopted to comprise HRNet, litePose, blazePose for executing reasoning; And outputting normalized coordinates or pixel coordinates of 35 key nodes defined by the COCO data set and confidence degrees of the normalized coordinates or pixel coordinates.
3. The method for serially detecting human body gestures by using time sequence ROI multiplexing as claimed in claim 1, wherein the extended ROI area in the step 2 is specifically: Based on the coordinate set of the effective joint points of the key frame, calculating the minimum value x_min and the maximum value x_max of the x coordinate, and the minimum value y_min and the maximum value y_max of the y coordinate to generate a minimum circumscribed rectangle; Setting expansion coefficients according to the motion amplitude classification, including a horizontal direction s_w and a vertical direction s_h, and calculating expanded ROI coordinates: x1' = max(0, x_min - (s_w-1)×W/2) y1' = max(0, y_min - (s_h-1)×H/2) x2' = min(img_w, x_max + (s_w-1)×W/2) y2' = min(img_h, y_max + (s_h-1)×H/2) Wherein w=x_max-x_min, h=y_max-y_min, img_w, img_h are image width, height, respectively; The non-key frame ROI multiplexing in the step 2 is used for preprocessing and strengthening, and specifically comprises the following steps: performing edge mirror padding or constant padding on the multiplexed ROI area; Scaling to the input size of the attitude estimation model by adopting a bilinear interpolation method; When the adjacent frame ROI center offset exceeds 30 pixels, a laplace sharpening process is performed.
4. The method for serially detecting human body gestures multiplexed by a time sequence ROI according to claim 1, wherein the key frame interval N and the triggering condition in the step 3 are specifically as follows: the full-map detection trigger condition includes any one of the following: a) Reaching a preset interval N; b) The number of effective joints of the non-key frames is less than 5; c) The center offset of the ROI of two continuous frames exceeds 15% of the width/height of the image; d) The node confidence mean value is lower than 0.4; Re-executing full-image detection after triggering, covering the ROI cache data, and supporting independent ROI parallel processing of the multi-person scene; the single frame buffer mechanism is used for buffering the extended ROI coordinates of the previous frame, 35 joint point coordinate sets and the confidence coefficient thereof, and the buffer period is 1 frame.
5. A human body posture serial detection device for time sequence ROI multiplexing is characterized by comprising: the key frame setting unit is used for executing full-image target detection on key frames of an input video sequence, outputting human body boundary frame coordinates (x 1, y1, x2 and y 2) by adopting a lightweight target detection algorithm, filtering redundant frames with confidence coefficient lower than a preset threshold value, then executing local gesture estimation on human body boundary frame areas, outputting 35 key joint point coordinate sets and the confidence coefficient thereof, and filtering low-reliability joint points with the confidence coefficient lower than the preset joint point threshold value; A non-key frame setting unit that performs local pose estimation on an extended ROI area of a previous frame multiplexed with an adjacent non-key frame, the extended ROI area being obtained by equal-proportion extension based on a minimum circumscribed rectangle of a key frame effective node; And the detection processing unit sequentially processes the video frame sequence according to a preset key frame interval N, and upgrades the current frame into a key frame and re-executes the key frame setting unit when the full-image detection triggering condition is met.
6. The apparatus of claim 5, wherein the whole-image object detection in the key frame setting unit comprises: preprocessing the key frame image according to the input size requirement of the target detection model; Carrying out reasoning by adopting YOLOv-nano, SSD-MobileNetV class lightweight algorithm or YOLOv-large, fast R-CNN class high-precision algorithm; The output format is a single or multiple bounding boxes of a human pixel coordinate format, and supports multi-human scene detection; the local pose estimation in the key frame setting unit specifically includes: Cutting and scaling the human body boundary box area to the fixed input size of the gesture estimation model, normalizing pixels and normalizing mean variance; the Top-Down class algorithm is adopted to comprise HRNet, litePose, blazePose for executing reasoning; And outputting normalized coordinates or pixel coordinates of 35 key nodes defined by the COCO data set and confidence degrees of the normalized coordinates or pixel coordinates.
7. The human body posture serial detection device for time sequence ROI multiplexing of claim 5, wherein the extended ROI area in the non-key frame setting unit is specifically: Based on the coordinate set of the effective joint points of the key frame, calculating the minimum value x_min and the maximum value x_max of the x coordinate, and the minimum value y_min and the maximum value y_max of the y coordinate to generate a minimum circumscribed rectangle; Setting expansion coefficients according to the motion amplitude classification, including a horizontal direction s_w and a vertical direction s_h, and calculating expanded ROI coordinates: x1' = max(0, x_min - (s_w-1)×W/2) y1' = max(0, y_min - (s_h-1)×H/2) x2' = min(img_w, x_max + (s_w-1)×W/2) y2' = min(img_h, y_max + (s_h-1)×H/2) Wherein w=x_max-x_min, h=y_max-y_min, img_w, img_h are image width, height, respectively; The non-key frame ROI multiplexing in the non-key frame setting unit is used for preprocessing and strengthening, and specifically comprises the following steps: performing edge mirror padding or constant padding on the multiplexed ROI area; Scaling to the input size of the attitude estimation model by adopting a bilinear interpolation method; When the adjacent frame ROI center offset exceeds 30 pixels, a laplace sharpening process is performed.
8. The human body posture serial detection device for time sequence ROI multiplexing of claim 5, wherein the key frame interval N and the triggering condition in the detection processing unit are specifically as follows: the full-map detection trigger condition includes any one of the following: a) Reaching a preset interval N; b) The number of effective joints of the non-key frames is less than 5; c) The center offset of the ROI of two continuous frames exceeds 15% of the width/height of the image; d) The node confidence mean value is lower than 0.4; Re-executing full-image detection after triggering, covering the ROI cache data, and supporting independent ROI parallel processing of the multi-person scene; the single frame buffer mechanism is used for buffering the extended ROI coordinates of the previous frame, 35 joint point coordinate sets and the confidence coefficient thereof, and the buffer period is 1 frame.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 4 when the program is executed by the processor.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 4.

Description

Human body posture serial detection method, device, equipment and medium for time sequence ROI multiplexing Technical Field The invention relates to the technical field of computer vision, in particular to a human body posture serial detection method, device, equipment and medium for time sequence ROI multiplexing. Background With the development of deep learning, applications such as human motion analysis, human-computer interaction, security monitoring and the like based on videos are increasingly popular. These applications typically rely on two core tasks, object detection (for locating the position of the human body in the image) and pose estimation (for identifying key nodes of the human body, forming a skeletal map). In the prior art, in order to realize real-time analysis of human body gestures, the following two methods are generally adopted: And (3) serial pipeline, namely running a target detection model for each frame of the video to obtain human body boundary frames, cutting and scaling the area in each boundary frame, and inputting the area into an attitude estimation model to detect key points. The mode has clear logic, but each frame needs to execute model reasoning twice, so that the calculation cost is huge, and the real-time processing capacity of the embedded equipment or the mobile terminal is severely restricted. And integrating the target detection and attitude estimation tasks into a unified neural network. Although the cost of data transmission is reduced, the network structure is usually very complex, the model parameters are large, the training difficulty is high, the flexibility is poor, and the optimization for specific tasks is difficult. Defects of the prior art: computational redundancy-in a serial pipeline, there is a high degree of spatio-temporal correlation between successive frames of video. The positions and attitudes of the human body in adjacent frames are not generally changed drastically. The complete target detection and attitude estimation are independently performed for each frame, which means a large number of repeated calculations, resulting in valuable waste of computational power. The real-time performance is poor, because of the above-mentioned calculation redundancy, the average time consumption for processing each frame is long, and it is difficult to implement real-time or super real-time processing in a high frame rate (such as 30FPS or more) video stream, and the low delay requirement of the interactive application cannot be satisfied. The high-intensity computing puts high demands on the CPU, GPU and memory of the device, which leads to increased power consumption and limits the deployment of the technology on battery-powered mobile devices or limited-computing-power edge devices. Therefore, there is an urgent need in the art for an efficient method that can effectively utilize video timing correlation, significantly reduce computation overhead, and simultaneously ensure pose estimation accuracy. Disclosure of Invention The invention aims to solve the technical problem of providing a human body posture serial detection method, device, equipment and medium for time sequence ROI multiplexing, which can greatly improve the processing speed of human body target detection and posture estimation in video stream on the premise of basically not sacrificing the precision. In a first aspect, the invention provides a human body posture serial detection method for time sequence ROI multiplexing, which is characterized by comprising the following steps: Step 1, performing full-image target detection on key frames of an input video sequence, outputting human body boundary frame coordinates (x 1, y1, x2, y 2) by adopting a lightweight target detection algorithm, filtering redundant frames with confidence coefficient lower than a preset threshold value, then performing local posture estimation on human body boundary frame areas, outputting 35 key joint point coordinate sets and confidence coefficients thereof, and filtering low-reliability joint points with the confidence coefficients lower than the preset joint point threshold value; Step 2, local pose estimation is carried out on an extended ROI area of a previous frame multiplexed by adjacent non-key frames, and the extended ROI area is obtained by equal proportion extension based on the minimum circumscribed rectangle of the key frame effective node; and 3, sequentially processing the video frame sequence according to a preset key frame interval N, and when the trigger condition of full-image detection is met, upgrading the current frame into a key frame and re-executing the step 1. In a second aspect, the present invention provides a human body posture serial detection device for time-sequence ROI multiplexing, including: the key frame setting unit is used for executing full-image target detection on key frames of an input video sequence, outputting human body boundary frame coordinates (x 1, y1, x2 and y 2) by a