CN-122024335-A - Human body local motion extraction method, system, equipment and storage medium

CN122024335ACN 122024335 ACN122024335 ACN 122024335ACN-122024335-A

Abstract

The application belongs to the technical field of motion analysis in computer vision, and discloses a method, a system, equipment and a storage medium for extracting local motion of a human body, wherein the method comprises the steps of providing frame images of a video sequence and generating a human body mask; generating a super-pixel area, carrying out optical flow estimation on a frame image of a video sequence to generate a full-image dense optical flow field, generating a motion salient area mask based on the super-pixel area and the full-image dense optical flow field, taking an intersection of the motion salient area mask and a human mask, and outputting a human local motion mask. The method and the device only execute super-pixel segmentation in the human mask region, reduce calculation cost of non-target regions, greatly reduce calculation complexity of optical flow aggregation and motion analysis by taking the super-pixel region as a calculation unit, construct a progressive constraint system by combining the super-pixel region, the full-image dense optical flow field and the human mask, accurately identify the significant motion region through a self-adaptive threshold strategy, and effectively distinguish a moving human body from a static background under a complex background, thereby reducing false detection rate.

Inventors

YANG JINXI
XIE YURUI
YU HAO
LI ZHONGCAI
HU ZIWEN
CHEN YINXI
ZHANG JUNCAI
Lei Quanlang

Assignees

成都信息工程大学

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (10)

1. The method for extracting the local motion of the human body is characterized by comprising the following steps of: Providing a frame image of a video sequence, detecting a human body region in the frame image of the video sequence, and generating a human body mask; performing superpixel segmentation in an effective area defined by the human mask to generate a superpixel area; performing optical flow estimation on frame images of a video sequence to generate a full-image dense optical flow field; Generating a motion salient region mask based on the super-pixel region and a full-map dense optical flow field; And intersecting the motion salient region mask with the human body mask to output a human body local motion mask.
2. The human body local motion extraction method according to claim 1, wherein a frame image of a video sequence is provided, a human body detection frame is obtained based on a target detection model, human body joint points are extracted in combination with a pre-training gesture detection model, and a human body mask is generated by fitting the human body detection frame with the human body joint points.
3. The method of claim 1, wherein the method of generating the super-pixel region comprises: uniformly initializing a clustering center according to the set number of super pixels in an effective area defined by a human mask; for each pixel in an effective area defined by a human mask, constructing a 5-dimensional feature vector, calculating a weighted feature distance between each pixel and each cluster center in a local search neighborhood of each cluster center, distributing the pixels to the cluster center closest to the cluster center, and obtaining a plurality of pixels corresponding to each cluster center; solving the mean value of a plurality of pixels corresponding to each cluster center on the space coordinates, namely obtaining a new position corresponding to the cluster center; Repeating the two steps until the displacement of each clustering center is smaller than a preset threshold value or the maximum iteration number is reached, and obtaining a super-pixel region.
4. A method of extracting a local motion of a human body according to claim 3, wherein the 5-dimensional feature vector comprises CIELAB color space components (L, a, b) and normalized spatial coordinates (x, y); the weighted feature distance between each pixel and each cluster center is calculated as follows: In the formula, Represent the first The first pixel Weighted feature distances between the cluster centers; Respectively represent the first A luminance component, a red-green chrominance component, and a yellow-blue chrominance component of the individual pixels in the CIELAB color space; Respectively represent the first Color components corresponding to the clustering centers in the CIELAB color space; Represent the first Spatial coordinates of the individual pixels in the original image; Represent the first Initial grid position coordinates of the clustering centers in the image; the method is characterized in that the method is used for defining a super-pixel scale parameter as an approximate value of the side length of a desired super-pixel area, and a calculation formula is as follows Wherein And The width and height of the image respectively, The number of the super pixels is preset; parameters are adjusted for compactness.
5. The method of claim 1, wherein generating a full-map dense optical flow field comprises: acquisition of temporally adjacent two frames of color images in a video sequence And ; The two frames of color images are respectively converted into gray images to obtain corresponding gray frames And ; For a pair of And Respectively performing quadratic polynomial expansion to obtain corresponding polynomial coefficients; Based on the constant brightness assumption, establishing an optical flow constraint equation derived from the corresponding polynomial coefficients; And forming a multi-scale image by a Gaussian pyramid on a frame image of the video sequence, repeating the two steps on a coarse scale of the multi-scale image, and repeating the two steps on a fine scale until all the multi-scale images are traversed, and obtaining a displacement vector of each pixel of the frame image of the video sequence, namely the full-image dense optical flow field.
6. The method of claim 1, wherein the method of generating a motion salient region mask comprises: Aggregating the two-dimensional displacement vector of each pixel point in the full-image dense optical flow field to a super-pixel area, and calculating the average motion amplitude of each super-pixel area; screening the motion salient region by using an adaptive threshold; carrying out space constraint on the motion salient region by utilizing the super-pixel region boundary, and smoothing the motion salient region boundary by using a boundary optimization algorithm; And closing the holes in the motion salient region by morphological operation, eliminating isolated noise points and generating a mask of the motion salient region.
7. The method according to claim 6, wherein the step of applying the adaptive threshold to screen the motion salient region includes calculating an overall average motion amplitude of each super-pixel region in the human region, and determining the super-pixel region having an average motion amplitude not lower than the motion threshold as the motion salient region based on a preset percentile and a preset minimum amplitude of the overall average motion amplitude.
8. The human body local motion extraction system is characterized by comprising: The human body mask generation module is used for providing frame images of the video sequence, detecting human body areas in the frame images of the video sequence and generating human body masks; the super-pixel region generation module is used for executing super-pixel segmentation in the effective region defined by the human mask to generate a super-pixel region; The optical flow estimation module is used for carrying out optical flow estimation on the frame images of the video sequence to generate a full-image dense optical flow field; a motion salient region mask generation module for generating a motion salient region mask based on the super-pixel region and the full-image dense optical flow field; and the motion output module is used for taking the intersection of the motion salient region mask and the human body mask and outputting the human body local motion mask.
9. Computer device, characterized in that it comprises a memory and a processor, said memory storing a computer program which, when executed by said processor, causes said processor to perform the steps of the method according to any of claims 1-7.
10. Computer readable storage medium, characterized in that a computer program is stored, which, when being executed by a processor, causes the processor to perform the steps of the method according to any of claims 1-7.

Description

Human body local motion extraction method, system, equipment and storage medium Technical Field The invention belongs to the technical field of motion analysis in computer vision, and particularly relates to a method, a system, equipment and a storage medium for extracting local motion of a human body. Background The human body local motion extraction is an important research direction in computer vision, and is widely applied to scenes such as intelligent monitoring, human-computer interaction, behavior recognition and the like. The traditional methods are mainly based on optical flow estimation, such as a global variational model of Horn-Schunck and local weighted fitting of Lucas-Kanade, and the methods are based on constant brightness assumption and space smoothness constraint to construct a motion field. However, under complex background interference, the traditional optical flow method pixel-by-pixel dense calculation is difficult to meet the performance requirement, and the global smooth assumption is easy to cause motion boundary blurring, so that the accuracy of human dynamic information extraction is limited. In recent years, the deep learning technology is applied to optical flow estimation, such as FlowNet2.0, PWC-Net and other models, which improves the optical flow estimation precision, but still faces the problems of boundary blurring and background interference in complex scenes. Super-pixel segmentation techniques are capable of preserving object boundaries, but do not distinguish between moving and stationary regions alone. How to realize accurate extraction of human body local motion under a complex background is a problem to be solved at present. Disclosure of Invention The invention aims to provide a method for extracting local motion of a human body so as to solve the problem of how to accurately extract the local motion of the human body under a complex background. The embodiment of the application is realized in such a way that the method for extracting the local motion of the human body comprises the following steps: S01, providing a frame image of a video sequence, detecting a human body region in the frame image of the video sequence, and generating a human body mask; S02, performing super-pixel segmentation in an effective area defined by the human mask to generate a super-pixel area; S03, carrying out optical flow estimation on a frame image of a video sequence to generate a full-image dense optical flow field; s04, generating a motion salient region mask based on the super-pixel region and the full-image dense optical flow field; s05, the motion salient region mask and the human body mask are intersected, and the human body local motion mask is output. In some embodiments, frame images of a video sequence are provided, a human detection frame is obtained based on a target detection model, human articulation points are extracted in combination with a pre-training gesture detection model, and a human mask is generated by fitting the human detection frame to the human articulation points. In some embodiments, a method of generating a superpixel region includes: s021, uniformly initializing a clustering center in an effective area defined by a human mask according to the set number of super pixels; S022, constructing a 5-dimensional feature vector for each pixel in an effective area defined by a human mask, calculating a weighted feature distance between each pixel and each cluster center in a local search neighborhood of each cluster center, and distributing the pixels to the cluster center closest to the cluster center, wherein each cluster center obtains a plurality of corresponding pixels; s023, obtaining the average value of a plurality of pixels corresponding to each clustering center on the space coordinates, namely, obtaining a new position corresponding to the clustering center; S024, repeating the two steps until the displacement of each clustering center is smaller than a preset threshold value or the maximum iteration number is reached, and obtaining a super-pixel region. In some embodiments, the 5-dimensional feature vector includes CIELAB color space components (L, a, b) and normalized spatial coordinates (x, y); the weighted feature distance between each pixel and each cluster center is calculated as follows: ; In the formula, Represent the firstThe first pixelWeighted feature distances between the cluster centers; Respectively represent the first A luminance component, a red-green chrominance component, and a yellow-blue chrominance component of the individual pixels in the CIELAB color space; Respectively represent the first Color components corresponding to the clustering centers in the CIELAB color space; Represent the first Spatial coordinates of the individual pixels in the original image; Represent the first Initial grid position coordinates of the clustering centers in the image; the method is characterized in that the method is used for defining a super-pixe