CN-122024328-A - Video action recognition method and system based on skeleton key points

CN122024328ACN 122024328 ACN122024328 ACN 122024328ACN-122024328-A

Abstract

The invention discloses a method and a system for identifying video actions based on skeleton key points. Processing video to be identified, generating a video frame sequence arranged in time sequence, extracting key point coordinates of human bones of each frame to construct a key point sequence, carrying out continuous time memory coding on the key point sequence to generate a corresponding memory state sequence, determining an action boundary based on memory state changes at adjacent moments, dividing the key point sequence into a plurality of action phases, determining joint weights according to the memory states of each phase to generate a weighted joint track, carrying out matching calculation on the weighted track and the corresponding template track in each phase range to obtain a phase matching result, carrying out continuity judgment by combining the matching result, the memory state and the phase sequence relation, and outputting action categories corresponding to the video. The method realizes the segmentation processing and time sequence information fusion of the actions and improves the accuracy and stability of the action recognition.

Inventors

LI ZHONG

Assignees

淞领智能科技(上海)有限公司

Dates

Publication Date: 20260512
Application Date: 20260331

Claims (7)

1. A skeletal-keypoint-based video motion recognition system, comprising: The video acquisition module acquires videos to be identified and generates a video frame sequence arranged in time sequence; The key point sequence construction module analyzes the gesture of the video frame sequence, extracts the key point coordinates of the human skeleton, and generates a skeleton key point sequence according to the time sequence of the video frame sequence; the memory state generation module inputs the skeleton key point sequence into the Legend memory unit to perform continuous time memory coding, and generates a memory state sequence corresponding to each moment of the skeleton key point sequence; the action boundary determining module is used for determining action boundary points according to state changes of adjacent moments in the memory state sequence, dividing the skeleton key point sequence according to the action boundary points and generating action phase sub-sequences which are arranged in time sequence; the weighted track generation module determines joint weights of skeleton key points according to the memory state sequences corresponding to the action phase sub-sequences, and generates weighted joint track sub-sequences by combining skeleton key point coordinates in the action phase sub-sequences; The matching recognition module is used for defining a matching interval of each weighted joint track subsequence according to the action boundary points, and carrying out matching calculation on each weighted joint track subsequence and a corresponding stage template track by adopting a Frechet distance action recognition method in the matching interval to generate a stage matching distance value; And the category judging module is used for judging the inter-stage continuity according to the time sequence relation among the stage matching distance value, the corresponding memory state sequence and the action boundary points, and outputting the action category corresponding to the video to be identified.
2. A video action recognition method based on skeleton key points is characterized in that the modules are realized by the following steps: acquiring a video to be identified and generating a video frame sequence arranged in time sequence; human body posture analysis is carried out on the video frame sequence, human body skeleton key point coordinates corresponding to all video frames are extracted, and a skeleton key point sequence is constructed according to the time sequence of the video frame sequence; Inputting the skeleton key point sequence into a Legendre memory unit, performing continuous time memory coding, and generating a memory state sequence corresponding to each moment of the skeleton key point sequence one by one; Determining action boundary points according to state change amounts of adjacent moments in the memory state sequence, dividing the skeleton key point sequence according to the action boundary points, and generating action phase sub-sequences which are arranged in time sequence; Determining joint weights of skeleton key points according to the memory state sequences corresponding to the sub-sequences of each action stage, and generating a weighted joint track sub-sequence by combining the skeleton key point coordinates in the sub-sequences of each action stage; Defining a matching interval of each weighted joint track sub-sequence according to the action boundary points, and carrying out matching calculation on each weighted joint track sub-sequence and a corresponding stage template track by adopting joint weighting segmentation constraint discrete Frechet distance in the matching interval to generate a stage matching distance value corresponding to each action stage sub-sequence; And executing cross-stage continuity judgment according to the stage matching distance value corresponding to each action stage sub-sequence, the memory state sequence corresponding to each action stage sub-sequence and the time sequence relation of action boundary points, and outputting the action category corresponding to the video to be identified.
3. The method for identifying video actions based on skeleton key points according to claim 2, wherein constructing a skeleton key point sequence comprises: receiving a sequence of time-ordered video frames, reading the first frame by frame The frame image is subjected to equal-proportion scaling and boundary filling processing according to a preset input size, scaling and filling offset parameters are recorded, and an attitude analysis input frame is generated; Human body target detection is carried out on the gesture analysis input frame, a human body detection frame is extracted, a single target area is selected according to detection confidence, an original image is cut according to the detection frame, a corresponding human body image block is generated, and a frame index is marked ; Inputting the human body image blocks into a key point detection network, outputting a preset number of two-dimensional coordinates of key points of human bones and confidence thereof, and establishing a frame-level key point coordinate set according to the topological structure sequence of the key points; Performing structural constraint complementation on key points with confidence coefficient lower than a preset threshold value, namely generating the key point coordinates according to a preset skeleton length proportion when adjacent key points with skeleton connection relation with the key points exist in the same frame When the same frame cannot be fully complemented, searching the same-name key point coordinates of the previous frame and the next frame, and generating the key point coordinates by adopting linear interpolation; Inversely mapping frame-level key point coordinates from a human body image block coordinate system to an original video frame coordinate system, and recovering a real space position according to a recorded scaling and filling offset parameters in an inverse mapping process; Calculating coordinate differences aiming at the same-name key points of adjacent video frames, comparing the coordinate differences with a preset jump threshold, marking the key points as abnormal points when the coordinate differences exceed the jump threshold, and replacing the key points with the same-name key point coordinates of the previous frame to generate a key point coordinate set with consistent time sequence; Sequentially arranging the coordinate sets of key points of all frames according to the time sequence of the video frame sequence to construct a skeleton key point sequence, wherein the first key point sequence is formed by Frame No The coordinates of the key points are expressed as The sequence of skeletal keypoints is expressed as And outputting the skeleton key point sequence.
4. The method for identifying video actions based on skeletal keypoints of claim 2, wherein generating the sequence of memory states comprises: Receiving a sequence of skeletal keypoints Will be at the first All key point coordinates of the frame are spliced into an input vector according to the index sequence of the fixed key points And will be at Presentation time stamp write time stamp queue of frames to form and A bound time sequence input record; Determining the order of Legendre memory cell as In the interval Upper fixing selection Sampling points, 0 th order to The value of the Legendre polynomial at each sampling point is written into a basic function table step by step, the value of the first derivative of each basic function at each sampling point is written into a guide table step by step, and the basic function table and the guide table are used as fixed table lookup data in the construction process; Continuous time memory core is built based on the basic function table and the guide table, and the building process comprises indexing each pair of steps Using pre-cured Gaussian-Legendre product nodes and weights, for "th Order basis function (sF) of the first order The product of the derivative of the order basis function is "at Integral evaluation on the upper part and writing state coupling coefficient matrix, for' the first The product of the order basis function and the input basis function is "at The integral evaluation is written into an input projection coefficient matrix; Writing state coupling coefficient matrix into continuous time state transition matrix Writing an input projection coefficient matrix into a continuous time input matrix Integrating the product node, weight, base function table, derivative table and Solidifying and storing; Will be The discretization process comprises reading frame interval nominal value corresponding to video frame sequence, discretizing continuous time state transition matrix according to the frame interval to obtain discrete state transition matrix And performing equivalent discretization on the continuous time input matrix to obtain a discrete input matrix Will be Writing the Legendre memory unit parameter area as a fixed operator; in the construction flow of Legendre memory unit, the input injection structure is constrained for skeleton topology, and the constraint implementation steps include reading key point topology edge set and generating adjacency list, each joint in adjacency list only contains its own index and joint index connected with it in topology edge set, inputting vector Dividing into joint channel blocks according to joint indexes, and generating according to an adjacency relation table Isomorphic mask matrix Setting the setting rule of the mask matrix as 1 when the corresponding joint channel block in the state dimension and the corresponding joint channel block in the input dimension meet the adjacent relation, or setting the mask matrix as 0 when the corresponding joint channel block in the state dimension and the corresponding joint channel block in the input dimension meet the adjacent relation, applying the mask matrix to the discrete input matrix to obtain a topological constraint input matrix, and writing the topological constraint input matrix into a parameter area for replacement : ; Wherein, the Representing an element-by-element product; in the construction flow of Legendre memory unit, a multi-step pushing algorithm table is established for unequal interval time pushing caused by frame loss and frame compensation, and the operator table construction step comprises the steps of setting an allowable maximum pushing step number upper bound To Pre-computing and storing matrix power sequences for base matrices Wherein Get the satisfaction of The running stage quantizes the time stamp difference of two adjacent frames to obtain the push progress number Will be Performing binary expansion and performing continuous multiplication according to expansion results to obtain a multi-step propulsion matrix: ; Wherein, the To advance step number Taking the multi-step propulsion matrix as a current frame state propulsion operator to participate in memory state calculation; initializing a memory state vector And writing into state buffer memory, traversing according to frame index increasing sequence With its binding time stamp, executing state advancing and input injection frame by frame, firstly according to advancing progress number Generating a multi-step propulsion matrix and pairing Advancing the execution state and then putting the execution state into the state Input matrix via topological constraint The mapped result is injected into the advanced state to obtain Will be And frame index Binding writing into a memory cache; Indexing by frame Sequential output memory buffer forming a memory state sequence Wherein With the first bone key point sequence The frames are in one-to-one correspondence, and the memory state sequence is output to the subsequent action boundary determining step.
5. The method for identifying video actions based on skeletal keypoints of claim 2, wherein generating the action phase subsequence comprises: Receiving a sequence of memory states Sequence of key points with bones Memory state vector at each moment The Legend order index is divided into a low order component and a high order component, the low-order component is formed by presetting The dimension is composed, and the higher-order component is composed of the other dimensions; calculating the energy of the higher order component and the energy of the lower order component every moment, calculating the increment of the higher order energy at the adjacent moment to form a boundary scoring sequence, wherein the boundary scoring is at the first moment The time is recorded as follows: ; Wherein, the Represent the first The high-order component of the moment in time, Represent the first The time-of-day low-order component, Is a fixed normal number constant; performing candidate boundary detection on the scoring sequence: with fixed window length pairs The local maximum search is carried out to satisfy that the current moment is the maximum value in the window and is not less than The moment when the frame is kept positive is recorded as a candidate boundary moment; Performing rigid interval constraint on candidate boundaries by setting minimum boundary interval as Frames, scanning the candidate boundary list in time sequence, the boundary interval between any two candidates being smaller than In the case of frames, only candidate boundaries with a larger score value are retained; Respectively taking frame segments with fixed length before and after the candidate boundary, respectively calculating and comparing the frame displacement accumulation values of all joints in the two segments, deleting the candidate boundary with the displacement accumulation value not meeting the preset minimum variation, and obtaining an action boundary point sequence; Adding the first frame index and the last frame index into an action boundary point sequence, and aligning the skeleton key point sequence according to the time sequence of boundary points And (3) performing interval segmentation to obtain a plurality of action phase sub-sequences which are arranged in time sequence.
6. The method for identifying video actions based on skeletal keypoints of claim 2, wherein generating the weighted joint trajectory subsequence comprises: receive the first Bone key point subsequence corresponding to each action phase subsequence And memory state subsequence Reading the phase start frame index And terminate frame index And length of stage Writing a phase parameter table; will memorize the state vector Splitting according to a pre-cured joint channel mapping table Each joint channel segment corresponds to one bone key point index one by one, and the front segment frame set is intercepted in the stage And the back-end frame set Respectively calculating the average value of two norms of each frame in the front section and the rear section of each joint channel section, and writing the average value into a joint average value cache; for the first The joints are respectively read, the average value of the rear section and the average value of the front section are read, the original scores of the joints are calculated, and joint score vectors are written; Performing topology conforming processing on the joint scoring vector; performing normalization processing on the joint score vector after the unification; traversing skeleton key point coordinates frame by frame in a stage, multiplying the joint weight by each coordinate component of the key point coordinates respectively to generate a first node Frame No The weighted key point coordinates are written into a weighted frame buffer memory; Per frame index slave To the point of Sequentially writing the weighted frame buffer into the stage track buffer to obtain the first stage And outputting the weighted joint track subsequences according to the time sequence of the action phases.
7. The method for identifying video actions based on skeleton key points according to claim 2, wherein generating phase matching distance values corresponding to sub-sequences of each action phase comprises: receive the first Weighted joint trajectory subsequence for each motion phase First, the Stage fixed joint weight vector and the first Stage template track Reading the phase start frame index given by action boundary point And terminate frame index Defining the track index set to be matched as And define the template track index set as ; Establishing a frame-to-distance table, wherein the frame-to-distance table is The distance at which is noted: ; Wherein, the Is the first The weight of each joint is calculated by the weight calculation method, For the number of joints, As a set of topological edges of the skeleton, To be matched with the track Frame No The coordinates of the joints are used to determine, Is the template track Frame No Joint coordinates; In the construction flow of discrete Frechet matching, establishing segment constraint of forced passing of boundary anchor points, specifically: The track to be matched is in the interval The method comprises the steps of calculating an accumulated displacement table frame by frame, calculating accumulated displacement by adopting a mode of accumulating the accumulated displacement after summation of displacement distances of adjacent two frames weighted by joint weights on all joints to obtain an accumulated displacement value corresponding to each frame, and dividing the total amount of the accumulated displacement from a starting point to an end point according to fixed equal fractions Performing equal division, scanning from front to back in an accumulated displacement table aiming at each equal-division threshold value, and recording frame indexes reaching the threshold value for the first time to obtain anchor point frame index sequences to be matched ; Calculating an accumulated displacement table according to the same rule for the template track and generating a template anchor point frame index sequence Pairing anchor points as And writing the anchor pairing into an anchor constraint table, wherein the anchor constraint table prescribes that the matching path must be pressed Sequentially passing through each pair of anchor points in an ascending order; In the construction flow of discrete Frechet matching, establishing index constraint of hard pruning of a band-shaped reachable domain, specifically: calculating the length of the track to be matched Length of the template Indexing templates by integer division and remainder Mapping to center index to be matched ; Index for each template Expanding a fixed width on both sides of a center index A frame position is obtained Corresponding index interval allowed to be matched Wherein , All of Write to the reachability-domain table, and specify that only Time frame pair For reachable frame pairs, unreachable frame pairs do not enter recursive computation; Dividing the whole matching interval into segments according to the anchor point constraint table A plurality of continuous subintervals, the endpoints of the subintervals are in sequence The piecewise constraint discrete Frechet distance is calculated for the subintervals section by section, and the calculation process is as follows: Establishing a two-dimensional recursion table in each subinterval, wherein the recursion table row index is a frame index to be matched allowed by the subinterval, the column index is a template frame index allowed by the subinterval, filling the recursion table in a row-column increasing order, each unit only allowing transfer from three adjacent units of 'upper, left and upper left', and taking the unit cost of the unit frame to the distance table value The source path cost takes the smaller of three adjacent source unit costs from the larger of the source path costs, and the unreachable frame pair Directly marking as unreachable and skipping filling, initializing the starting point unit of each subinterval by using the corresponding frame pair distance; Reading the cost value of the sub-interval end point unit after filling of each sub-interval is completed as the segmented Frechet value of the sub-interval, and taking the maximum value of all the sub-interval segmented Frechet values as the first value Stage matching distance values of stages; Will be the first Phase matching distance value and phase index of phase Binding and writing the matching result table of the stages, and outputting a matching distance value set of each stage according to the time sequence of the action stage.

Description

Video action recognition method and system based on skeleton key points Technical Field The invention relates to the technical field of computer vision and intelligent video, in particular to a method and a system for identifying video actions based on skeleton key points. Background With the development of applications such as intelligent security, man-machine interaction and sports analysis, human motion recognition based on video has become an important research direction in the field of computer vision, currently, the mainstream method mostly adopts a time sequence modeling mode based on image features or skeleton key points, and the video sequence is integrally modeled through a deep learning model to realize motion category judgment. The existing video motion recognition technology still has obvious limitations under the conditions of complex motion scenes, multi-stage motion and uneven time rhythm. On one hand, the traditional method generally adopts a fixed time window or integral sequence modeling, lacks accurate depiction of action boundaries, is difficult to effectively distinguish phase changes in continuous actions, and is easy to cause mutual interference among different action fragments, so that the identification accuracy is reduced. On the other hand, in the time sequence characteristic expression process, the existing method is mostly dependent on a short-time memory or discrete sampling mode, has insufficient adaptability to long-time dependency and non-uniform time intervals, and is difficult to stably characterize dynamic changes of complex actions. In addition, the existing identification method based on skeleton key points generally endows each joint with fixed or simple weight, and the importance difference of the key joints in different action stages cannot be fully reflected, so that feature expression is not fine enough, meanwhile, the conventional track matching method lacks stage constraint and structure constraint, high-precision matching is difficult to realize under action segmentation conditions, matching deviation is easy to generate, and the stability and reliability of a final classification result are affected. Therefore, how to provide a method and a system for identifying video actions based on skeletal key points is a problem that needs to be solved by those skilled in the art. Disclosure of Invention The invention aims to provide a video motion recognition method and a system based on skeleton key points, which realize automatic division and multi-stage motion modeling of motion boundaries by constructing time sequence data of the skeleton key points and introducing a continuous time memory coding mechanism, and simultaneously finish fine recognition of complex motions by combining joint weight self-adaptive distribution and segmentation constraint track matching calculation. According to the embodiment of the invention, the method and the system for identifying the video action based on the skeletal key point comprise the following steps: The video acquisition module acquires videos to be identified and generates a video frame sequence arranged in time sequence; The key point sequence construction module analyzes the gesture of the video frame sequence, extracts the key point coordinates of the human skeleton, and generates a skeleton key point sequence according to the time sequence of the video frame sequence; the memory state generation module inputs the skeleton key point sequence into the Legend memory unit to perform continuous time memory coding, and generates a memory state sequence corresponding to each moment of the skeleton key point sequence; the action boundary determining module is used for determining action boundary points according to state changes of adjacent moments in the memory state sequence, dividing the skeleton key point sequence according to the action boundary points and generating action phase sub-sequences which are arranged in time sequence; the weighted track generation module determines joint weights of skeleton key points according to the memory state sequences corresponding to the action phase sub-sequences, and generates weighted joint track sub-sequences by combining skeleton key point coordinates in the action phase sub-sequences; The matching recognition module is used for defining a matching interval of each weighted joint track subsequence according to the action boundary points, and carrying out matching calculation on each weighted joint track subsequence and a corresponding stage template track by adopting a Frechet distance action recognition method in the matching interval to generate a stage matching distance value; And the category judging module is used for judging the inter-stage continuity according to the time sequence relation among the stage matching distance value, the corresponding memory state sequence and the action boundary points, and outputting the action category corresponding to the video to be identified