CN-122023465-A - Monocular video multi-person 3D human motion reconstruction method based on multi-module fusion

CN122023465ACN 122023465 ACN122023465 ACN 122023465ACN-122023465-A

Abstract

The invention relates to a monocular video multi-person 3D human body motion reconstruction method based on multi-module fusion, which solves the defect that when a target is partially shielded or moves rapidly, tracking is lost, and discontinuous or unstable reconstruction is generated when the target reappears. The method comprises the following steps of obtaining a single-frame or multi-frame motion video sequence, constructing a motion perception semantic tracking module, generating a coherent human body grid sequence, predicting future gesture features in a shielding scene, and outputting a 3D motion state. According to the invention, 3D motion reconstruction with shielding robustness, stable identity and consistent time sequence is realized through the collaborative design of motion perception tracking, time sequence enhancement reconstruction, motion prediction and self-adaptive fusion.

Inventors

ZHONG JINQIN
ZHU YUZHU
ZHU NING
LI LEI
JIA SEN
LIANG ZHENG

Assignees

安徽大学

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (7)

1. A monocular video multi-person 3D human motion reconstruction method based on multi-module fusion is characterized by comprising the following steps: 11 Acquiring a single-frame or multi-frame motion video sequence; 12 Constructing a motion perception semantic tracking module based on a SAM2 visual segmentation large model, taking a single-frame or multi-frame video sequence as input, realizing cross-frame stable identity association through a motion guide selector and a time sequence buffer, fusing Kalman filtering and motion consistency scoring by the motion guide selector, generating a cut image and characteristics of a current frame tracking example and a time sequence memory buffer from a past frame by the time sequence buffer by adopting an exponential moving average strategy, and updating the cut image and characteristics into a memory library; 13A T-HMR module is built by taking a clipping image, characteristics and a time sequence memory buffer from a previous frame of a current frame tracking example in a memory library generated by a motion perception semantic tracking module as input, effective time sequence characteristics in a neighborhood frame are screened through the memory buffer, and a MemFormer structure is utilized to inject a space-time prior into a 3D human body grid reconstruction process so as to generate a skin multi-person linear model parameter sequence of a plurality of frames in the past; 14 Designing a lightweight predictor module, constructing a motion dynamic model based on a Transformer network, taking a skin multi-person linear model parameter sequence of a plurality of frames in the past generated by the T-HMR module as input, storing historical motion states through a sliding window, and predicting the future gesture features in the occlusion scene; 15 Outputting a 3D motion state, namely constructing a gate-control fusion device module, reconstructing skin multi-person linear model parameters and future gesture features predicted by the Predictor module from the current frame of the T-HMR module to generate stable next motion, fusing the representations through a learnable gate control, returning the skin multi-person linear model parameters at a single head, and outputting the 3D motion state with consistent time sequence.
2. The monocular video multi-person 3D human motion reconstruction method based on multi-module fusion of claim 1, wherein the constructing a motion-aware semantic tracking module comprises the steps of: 21 Inputting the current frame image into the SAM2 visual segmentation large model, representing its observations as a bounding box pre-parameter for the instance detected at time step k : , Wherein, the Is the x coordinate of the center point of the boundary frame, Is the y coordinate of the center point of the boundary frame, Is the width of the boundary frame, Is the height of the bounding box and its potential motion state Modeling is as follows: ; Wherein, the Is in the state of speed, and the speed is equal to the speed, The speed of the x coordinate of the center point represents the movement speed of the target in the horizontal direction; The speed of the y coordinate of the center point represents the movement speed of the target in the vertical direction; Is the rate of change of the bounding box width, representing the movement or dimensional change of the object in the depth direction; The change rate of the height of the bounding box represents the movement or posture change of the target in the depth direction; 22 Modeling the timing dynamics under Gaussian noise and linear assumptions using a Kalman filter, KF predicting a bounding box for the next frame given the bounding box parameters of the previous state, i.e., the past frame And estimating uncertainty of the current frame, wherein the filter performs condition updating only when the quality of a target segmentation mask detected by the current frame and consistency with motion prediction meet a strong threshold value, and the cut image of the current frame tracking example is put into a memory bank to allow robust state propagation under partial or noise observation; 23 Calculating a candidate segmentation mask from the SAM2 visual segmentation large model, motion consistency score by IoU of the kalman prediction box, i.e. the ratio of the intersection area to the union area of the two bounding boxes Bounding box thereof Comparison with KF predictions: ; 24 By means of a gated summation Mask affinity to SAM2 And combining: , Wherein, the In order to fuse the weights, the weights are, Is a fused confidence score; 25 Introducing a confidence gating update mechanism when the confidence of the mask continuously exceeds a threshold And when the image feature is not stored in the memory, updating the Kalman filtering posterior state, and updating the image feature to the memory, otherwise, retaining the historical state, wherein the formula is as follows: , Wherein, the In order to continuously and reliably correlate the counters, For the Kalman gain, H is the state transition matrix, For observed data that has been fused with the current frame, i.e. the kth frame Then, the obtained optimal estimation of the target state, For optimum state based on the last frame only, i.e. the k-1 frame And a motion model of the system, a prediction made of the current frame target state, The optimal state estimation is obtained after the previous frame is fused with the self observation data; 26 In the timing buffer, adaptively updating the timing memory buffer of the previous frame to the memory bank by smoothing the weighting At each time step t, memory embedding And current key features The combination is as follows: , Wherein: , In order to adapt the attenuation factor to the individual signals, For the kalman motor consistency score, For the decay threshold, the maximum weight that the current observed information can occupy in the memory update is specified, adjusted by the motion consistency score.
3. The monocular video multi-person 3D human motion reconstruction method based on multi-module fusion of claim 1, wherein the generating a coherent human mesh sequence comprises the steps of: 31 Building a memory buffer module: ViT coding features of L frames in the neighborhood of the current frame are collected, and a memory feature matrix is constructed Wherein N is the number of space tokens and d is the feature dimension; 32 Pooling the current frame representation and memory features in the spatial dimension to obtain a global representation And ; Adopts a double-branch scoring mechanism pair Ordering importance of each frame in the list, defining a unified scoring function based on attention : , Wherein: querying feature matrices Representing the current frame, i.e. the characteristic representation of the human body instance which is wanted to be reconstructed in time step t, key characteristic matrix Representing a characteristic representation from the history frames stored in the memory buffer; Querying a projection weight matrix Sum key projection weight matrix Two independent, learnable parameter matrices for "feature transformation" of the original query and key features, mapping it to a "attention space" more suitable for similarity comparison; the function refers to softmax normalization along the last dimension, the key dimension; 33 A double-branch scoring mechanism is adopted to calculate the importance score of the frame, the first branch calculates the correlation between the current frame and the memory frame through cross attention, the second branch evaluates the internal consistency of the memory frame through self attention, and the scoring formula is as follows: , Where s is the composite importance score vector of the memory frame, Is a global pooled representation of the current frame features, Is a global pooled representation of memory frame features; selecting the first k frame features with the highest overall importance scores to form an effective memory bank for subsequent MemFormer reasoning; 34 A) the set MemFormer structure includes N stacked blocks, each block implementing the following operations: Splicing the learnable skin multi-person linear model token with the current frame characteristics, and capturing the local structure information of the human body through the characteristic interaction in the self-attention layer modeling frame; taking the characterization after intra-frame interaction as query, taking the characteristic of the effective memory library as a Key Key and a Value after being subjected to space dimension pooling, executing time sequence cross attention operation, and injecting a motion consistency space-time prior; 35 The memory features are pooled through time dimension and then used as keys and values, the cross attention operation is executed again, and the semantic information of space alignment is injected; 36 Extracting the skin multi-person linear model token after multi-layer attention interaction, outputting gesture parameters and shape parameters of the skin multi-person linear model through MLP decoding, and generating the 3D human body grid of the current frame.
4. The monocular video multi-person 3D human motion reconstruction method based on multi-module fusion of claim 1, wherein predicting future pose characteristics in an occlusion scene comprises the steps of: 41 Setting Predictor module to use FIFO queues: , Wherein, the Is a queue at the moment of time step t, The motion state obtained by reconstruction in the time step t refers to the attitude parameters of the skin multi-person linear model Or 3D joint coordinates extracted therefrom, T is a fixed length of the queue, t=8 or t=16 frames; 42 Storing the motion state of the last T frames Capturing motion dynamic characteristics through an L-layer transducer block and outputting potential representation of the next frame The representation contains predicted gesture feature information, namely future gesture features, and provides motion prior support for 3D reconstruction in an occlusion scene; 43 The motion queue is updated online in the reasoning process, and the motion queue is adapted to a dynamic motion mode in real time.
5. The monocular video multi-person 3D human motion reconstruction method based on multi-module fusion of claim 1, wherein the outputting the 3D motion state comprises the steps of: 51 The gating vector is calculated through the MLP layer by reconstructing the parameters of the skin multi-person linear model of the current frame from the T-HMR module and the predicted characteristics of the next frame from the Predictor module : , Wherein, the For the sigmoid activation function, Features are reconstructed for the T-HMR, For the next frame feature based on historical motion prediction, Is a multi-layer perceptron for generating gating vectors; 52 Using weighted interpolation fusion features The formula is: , wherein, as follows, the element-wise multiplication, When the scene is free of occlusion and the reconstruction characteristics are reliable, Approaching 0, the fusion features are mainly reconstruction features; when there is severe occlusion in the scene and the reconstructed feature is unreliable, Approaching 1, the fusion characteristic is mainly a prediction characteristic; 53 To be fused features And inputting a regression head, and mapping the regression head into final skin multi-person linear model gesture and shape parameters to finish the 3D human motion reconstruction of the current frame.
6. A computer readable storage medium, wherein a computer program is stored on the storage medium, and when the computer program is executed by a processor, the monocular video multi-person 3D human motion reconstruction method based on multi-module fusion according to any one of claims 1-5 can be implemented.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, implements the method for reconstructing a monocular video multiplayer 3D human motion based on a multimodular fusion of any of claims 1-5.

Description

Monocular video multi-person 3D human motion reconstruction method based on multi-module fusion Technical Field The invention relates to the technical field of computer vision and motion capture, in particular to a monocular video multi-person 3D human motion reconstruction method based on multi-module fusion. Background With the development of computer vision technology, reconstructing 3D human motion from monocular video has become a popular research direction. The traditional motion capture system depends on a multi-view camera or a wearable mark point, has the problems of high cost, complex deployment, strong invasiveness and the like, and is difficult to apply to field scenes. Monocular video is an ideal data source for multi-body 3D motion capture by virtue of low cost, non-invasive advantages, but the existing methods face a number of challenges: 1. the shielding problem is that partial or complete shielding frequently occurs in a multi-human interaction scene, so that identity association is interrupted and tracking is lost, the existing identity association is seriously dependent on appearance feature matching, such as Hungary algorithm, and the like, and is easy to fail under serious shielding, so that identity switching or track loss is caused. PHALP, 4 and DHumans methods combine 3D features, but the correlation robustness is still insufficient under severe occlusion; 2. the rapid movement and visual angle change, namely the tracking method based on the appearance is extremely sensitive to appearance blurring and deformation caused by the rapid movement, and identity switching is easy to occur to the tracking method based on the 2D appearance characteristics when the target rapidly moves or the visual angle is severely switched, so that 3D movement sequence fragmentation is caused. Most existing methods lack explicit, robust kinetic modeling to compensate for unreliability of appearance information; 3. The time sequence consistency is poor, the existing method depends on single frame regression, and the existing method lacks effective space-time prior fusion, so that the reconstructed motion is incoherent and the jitter is serious. Many existing 3D reconstruction methods (such as HMR, SPIN) are essentially single frame estimation, and do not fully utilize the sequential continuity of video, resulting in jitter and incoherence of the reconstruction result; 4. The generalization capability is insufficient, that is, many multi-target tracking and 3D human body reconstruction methods (e.g. 4DHumans, coMotion) need to train or fine tune on specific datasets (e.g. PoseTrack), the performance of which can be reduced in out-of-field or complex real scenes, the robustness is poor in field complex scenes (e.g. sports games), and Zero-shot (Zero sample) application is difficult to realize. Disclosure of Invention The invention aims to solve the defect that in the prior art, when a target is partially blocked or fast moved, tracking is lost, so that discontinuous or unstable reconstruction is generated when the target reappears, and provides a monocular video multi-person 3D human body movement reconstruction method based on multi-module fusion to solve the problems. In order to achieve the above object, the technical scheme of the present invention is as follows: a monocular video multi-person 3D human motion reconstruction method based on multi-module fusion comprises the following steps: Acquiring a single-frame or multi-frame motion video sequence; Constructing a motion perception semantic tracking module based on a SAM2 visual segmentation large model, taking a single-frame or multi-frame video sequence as input, realizing cross-frame stable identity association through a motion guide selector and a time sequence buffer, fusing Kalman filtering and motion consistency scoring by the motion guide selector, generating a cut image and characteristics of a current frame tracking example and a time sequence memory buffer from a past frame by the time sequence buffer by adopting an exponential moving average strategy, and updating the cut image and characteristics into a memory library; generating a coherent human body grid sequence, namely taking a cut image, characteristics and a time sequence memory buffer from a previous frame of a current frame tracking example in a memory library generated by a motion perception semantic tracking module as input, building a T-HMR module, screening effective time sequence characteristics in a neighborhood frame through the memory buffer, and injecting a space-time prior into a 3D human body grid reconstruction process by utilizing a MemFormer structure so as to generate a skin multi-person linear model parameter sequence of a plurality of coherent previous frames; Designing a lightweight predictor module, constructing a motion dynamic model based on a Transformer network, taking a skin multi-person linear model parameter sequence of a plurality of frames in the past gene