CN-119992651-B - Motion capture optimization method and model training method based on monocular video

CN119992651BCN 119992651 BCN119992651 BCN 119992651BCN-119992651-B

Abstract

The application provides a motion capture optimization method and a model training method based on monocular video, and relates to the technical field of computers. The motion capture optimization method based on the monocular video comprises the steps of inputting the monocular video into a touchdown detection model to obtain a touchdown detection result used for indicating the touchdown probability of the foot of the portrait, obtaining three-dimensional motion first data of the portrait according to the monocular video to obtain foot tracks of the portrait, performing touchdown optimization on the foot tracks according to the touchdown probability, performing shake optimization on the foot tracks after the touchdown optimization, and reconstructing the three-dimensional motion first data through a reverse dynamics algorithm according to the foot tracks after the shake optimization to obtain three-dimensional motion second data of the portrait. The method solves the problem that the accuracy of the three-dimensional action data reconstructed by the inverse dynamics algorithm is reduced due to the fact that the phenomenon of mold penetration and shaking possibly occurs in the three-dimensional reconstruction process of the portrait foot.

Inventors

MING ANLONG
CHI HAORAN

Assignees

北京邮电大学

Dates

Publication Date: 20260512
Application Date: 20250115

Claims (9)

1. A motion capture optimization method based on monocular video, the method comprising: The method comprises the steps of obtaining a monocular video, inputting the monocular video into a preset touchdown detection model to obtain a touchdown detection result output by the touchdown detection model, wherein the touchdown detection result is used for indicating the touchdown probability of a portrait foot in the monocular video, the touchdown detection model is obtained through the training of a labeled touchdown data set, and the labeled touchdown data set is used for indicating whether the portrait foot in a training frame touches the ground or not; obtaining three-dimensional action first data of the portrait according to the monocular video, and obtaining foot tracks of the portrait according to the three-dimensional action first data; Performing ground contact optimization on the foot track according to the ground contact probability, obtaining a foot state according to the ground contact probability and the tail end position of the foot, and performing shake optimization on the foot track after the ground contact optimization according to the foot state, wherein the foot state is used for indicating the ground contact state of the portrait double feet; reconstructing the three-dimensional motion first data through a reverse dynamics algorithm according to the foot track after jitter optimization to obtain three-dimensional motion second data of the portrait; The performing touchdown optimization on the foot track according to the touchdown probability comprises: According to the foot track, a first foot position is obtained through a forward dynamics algorithm, and according to the touchdown probability and the first foot position, an estimated ground is constructed through a clustering algorithm; And performing ground contact optimization on the foot track according to the ground contact probability, the first foot position, the estimated ground and the foot speed indicated by the foot track.
2. The method of claim 1, wherein the first foot position comprises a toe position and a heel position, and the touchdown probability comprises a toe touchdown probability and a heel touchdown probability, and the foot speed comprises a toe speed and a heel speed; The performing touchdown optimization on the foot track according to the touchdown probability, the first foot position, the estimated ground, and the foot speed indicated by the foot track, includes: Obtaining a slip loss according to the toe strike probability, the heel strike probability, the toe speed and the heel speed; rasterizing the foot according to the toe position and the heel position, and obtaining the mold penetration loss according to the estimated ground and the rasterized foot; according to the sliding loss, the mode passing loss and the preset attitude loss, obtaining a weighted result of each weight combination through a preset plurality of weight combinations by weight calculation; and performing touchdown optimization on the foot track according to the weighted result with the smallest numerical value among the weighted results.
3. The method according to claim 1, characterized in that the method further comprises: and obtaining a second foot position according to the foot track after touchdown optimization through a forward dynamics algorithm, and eliminating the shake of the tail end of the second foot position through a first filtering algorithm to obtain the tail end position of the foot.
4. A method according to claim 3, wherein the foot condition is one of a double foot full touchdown, and a partial touchdown; When the foot state is that all feet touch the ground, the shake optimization is performed on the foot track after the ground touching optimization according to the foot state, including: Eliminating foot movement; When the foot state is that all feet are separated from the ground, the performing shake optimization on the foot track after the ground contact optimization according to the foot state comprises the following steps: jitter optimization is carried out on the foot track after the touchdown optimization through track acceleration and acceleration constraint; When the foot state is the partial touchdown, the performing jitter optimization on the foot track after touchdown optimization according to the foot state includes: And performing jitter optimization on the foot track subjected to the touchdown optimization through a second filtering algorithm and a frame inserting algorithm.
5. The method according to claim 1, wherein obtaining three-dimensional motion first data of a portrait from the monocular video includes: According to the monocular video, a first human figure track, a two-dimensional loss of the human figure, a smoothing function and an initial gesture are obtained through gesture estimation; Optimizing rotation and displacement of camera parameters according to the first portrait track, the two-dimensional loss and the smoothing function to obtain a second portrait track; and optimizing the rotation, displacement and posture of the second portrait track according to the two-dimensional loss, the smoothing function and the initial posture to obtain the three-dimensional action first data.
6. A method of model training, the method comprising: the method comprises the steps of obtaining a labeled ground contact data set, wherein the ground contact data set comprises a plurality of training frames, and each training frame comprises a portrait; Training to obtain a touchdown detection model according to the labeled touchdown data set, wherein the touchdown detection model is used for the monocular video-based motion capture optimization method according to any one of claims 1 to 5.
7. The method of claim 6, wherein training the ground contact detection model based on the labeled ground contact data set comprises: Performing multi-view rendering on the marked touchdown data set to obtain a key point model, wherein the key point model is used for indicating rotation information and track information of key points of the portrait; according to the key point model, obtaining intra-frame and inter-frame joint interaction information through a preset DSTformer network and a transfomer attention mechanism; And training to obtain the touchdown detection model through a preset binary cross entropy loss function according to the joint interaction information.
8. The method of claim 7, wherein in each iteration of training, the method further comprises: calculating a first loss function of the touchdown probability output by the touchdown detection model and a speed loss function between frames; obtaining a second loss function through weighted calculation according to the first loss function and the speed loss function, wherein the weight of the first loss function is larger than that of the speed loss function; the second loss function is taken as a new first loss function.
9. The method of claim 6, wherein the acquiring the annotated touchdown data set comprises: Acquiring the touchdown data set; According to the touchdown data set, constructing a spring model of the figure to obtain the foot stress condition of the figure; the marking data of each training frame is obtained according to the stress condition of the foot, wherein the marking data is used for indicating whether the portrait foot in the corresponding training frame touches the ground or not; and labeling the ground contact data set according to the labeling data of each training frame.

Description

Motion capture optimization method and model training method based on monocular video Technical Field The application relates to the technical field of computers, in particular to a motion capture optimization method and a model training method based on monocular video. Background The motion capture technology is an interdisciplinary technology integrating image processing, computer vision and human kinematics, and is characterized in that human motion information is captured, and a data base is provided for various fields such as virtual reality, augmented reality, animation production, medical rehabilitation and the like. Among the numerous motion capture techniques, monocular video-based motion capture techniques are becoming the focus of research by virtue of their economy, convenience, and applicability. The technology can extract key points of the portrait by only using a single camera or a single video stream as information input, and reconstruct three-dimensional action data of the portrait by a reverse dynamics (INVERSE KINEMATICS, IK) algorithm. However, due to lack of depth information, the motion capture technology based on monocular video often suffers from problems of reduced spatial positioning accuracy and shielding of key points, and meanwhile, the extraction of the key points may be interfered due to changes of environmental factors such as illumination conditions. These problems lead to the fact that the phenomenon of mold penetration and shaking of the human foot part may occur in the three-dimensional reconstruction process, and further the accuracy of the three-dimensional motion data reconstructed through the inverse dynamics algorithm is reduced. Disclosure of Invention The application provides a motion capture optimization method and a model training method based on monocular video, which are used for solving the problem that the accuracy of three-dimensional motion data reconstructed by a reverse dynamics algorithm is reduced due to the fact that the phenomenon of mold penetration and shaking possibly occurs in a three-dimensional reconstruction process of human feet. The first aspect of the application provides a motion capture optimization method based on monocular video, which comprises the following steps: The method comprises the steps of obtaining a monocular video, inputting the monocular video into a preset touchdown detection model to obtain a touchdown detection result output by the touchdown detection model, wherein the touchdown detection result is used for indicating the touchdown probability of the portrait foot in the monocular video; obtaining three-dimensional action first data of the portrait according to the monocular video, and obtaining foot tracks of the portrait according to the three-dimensional action first data; performing touchdown optimization on the foot track according to the touchdown probability, and performing jitter optimization on the foot track after touchdown optimization according to the touchdown probability; reconstructing the three-dimensional motion first data through a reverse dynamics algorithm according to the foot track after jitter optimization to obtain three-dimensional motion second data of the portrait. In one possible design, the touchdown optimization of the foot trajectory according to the touchdown probability includes: according to the foot track, a first foot position is obtained through a forward dynamics algorithm, and according to the touchdown probability and the first foot position, an estimated ground is constructed through a clustering algorithm; and performing ground contact optimization on the foot track according to the ground contact probability, the first foot position, the estimated ground and the foot speed indicated by the foot track. In one possible design, the first foot position includes a toe position and a heel position, and the strike probability includes a toe strike probability and a heel strike probability, and the foot speed includes a toe speed and a heel speed; According to the ground contact probability, the first foot position, the estimated ground and the foot speed indicated by the foot track, the ground contact optimization is carried out on the foot track, and the ground contact optimization method comprises the following steps: Obtaining a slip loss according to the toe strike probability, the heel strike probability, the toe speed and the heel speed; According to the toe position and the heel position, rasterizing the foot, and according to the estimated ground and the rasterized foot, obtaining the mold penetration loss; according to the slip loss, the mode passing loss and the preset attitude loss, obtaining a weighted result of each weight combination through the preset multiple weight combinations by weight calculation; And performing touchdown optimization on the foot track according to the weighted result with the smallest numerical value in the weighted results. In one possible design