CN-122024325-A - 4D interactive reconstruction method for hand and hinged object based on monocular video input

CN122024325ACN 122024325 ACN122024325 ACN 122024325ACN-122024325-A

Abstract

A4D interactive reconstruction method for a hand and a hinged object based on monocular video input belongs to the technical field of three-dimensional reconstruction. The invention aims at the problem that the 4D physical interaction between the hand and the hinged object cannot be reconstructed under the monocular video condition. The method comprises the steps of obtaining a monocular video sequence and a repair video sequence with hand areas removed, obtaining a normalized three-dimensional grid of an articulated object based on a normalized frame image, optimizing the normalized three-dimensional grid of the articulated object by adopting a self-adaptive sampling refinement method, tracking component-level dense points of the repair video sequence to obtain a three-dimensional tracking track of each component, solving and optimizing rigid body transformation parameters to obtain a rigid body motion parameter reconstruction result of each component, optimizing the measurement scale and jointly optimizing joint gesture parameters of the hand and global rigid body transformation of the hand based on guidance of contact interaction information to obtain a 4D hand interaction reconstruction result. The invention can recover the physical reasonable 4D representation in the field interaction scene with complex and serious shielding.

Inventors

ZHANG ZHILU
WANG ZIKAI
WANG YIQING
LI HUI
ZUO WANGMENG

Assignees

哈尔滨工业大学

Dates

Publication Date: 20260512
Application Date: 20260205

Claims (10)

1. A hand and articulated object 4D interactive reconstruction method based on monocular video input, which is characterized by comprising, Acquisition includes Monocular video sequence of frame RGB images Wherein Is the first A frame RGB image; Removing hand regions in each frame of RGB image to obtain a repair video sequence Wherein Is the first Frame repair of RGB images in video sequences Selecting one frame of RGB image as a standard frame image, obtaining a normalized three-dimensional grid of the hinged object based on the standard frame image, and estimating a repair video sequence by using a monocular depth estimation model Is a measure of depth; To normalize frame images, repair video sequences The measurement depth of the hinged object and the normalized three-dimensional grid of the hinged object are used as input, a self-adaptive sampling refinement method is adopted to process the normalized three-dimensional grid of the hinged object, and the measurement scale and the six-degree-of-freedom space pose of the hinged object in a standard frame image are solved through an optimization method, so that the measurement standard hinged object grid under a world coordinate system is obtained; for repairing video sequence Performing component-level dense point tracking to obtain two-dimensional motion tracks of all components of the hinged object, and combining to-be-repaired video sequences Based on the three-dimensional tracking track of each part of the articulated object, carrying out solving optimization on rigid body transformation parameters of each part of the articulated object in the measurement standard articulated object grid to obtain a rigid body motion parameter reconstruction result of each part of the articulated object; For monocular video sequences Reconstructing the hand region to obtain a hand 4D grid, and simultaneously utilizing a multi-mode large model to perform reasoning to obtain a monocular video sequence Contact interaction information based on visual and geometric information; And then, on the basis of the measurement scale after the optimization of the fixed articulated object, carrying out joint optimization on the hand joint gesture parameters and the hand global rigid transformation in the hand 4D grid to obtain a reconstruction result of 4D hand interaction.
2. The method for 4D interactive reconstruction of a hand and an articulated object based on monocular video input according to claim 1, wherein the method for obtaining a normalized three-dimensional grid of the articulated object is as follows: Video segmentation model pair The treatment is carried out in such a way that, Obtaining hand mask, based on the hand mask, for monocular video sequence A kind of electronic device Removing hand regions and complementing the articulated object by using the video patching model in the frame RGB image to obtain a repairing video sequence Obtaining a repair video sequence using a video segmentation model A kind of electronic device Hinged object mask for frame repair of RGB images Selecting a c-th frame repair RGB image As canonical frame images , The mask of the corresponding standard frame hinged object is In the standard frame image Mask for medium application standard frame hinged object Obtaining an articulated object image, inputting the articulated object image into a three-dimensional reconstruction model to obtain an articulated object normalized three-dimensional grid 。
3. The method for 4D interactive reconstruction of a hand and an articulated object based on monocular video input according to claim 2, wherein the method for obtaining a metric specification articulated object grid under a world coordinate system is as follows: Will normalize the frame image Is expressed as a canonical frame metric depth of (2) Measuring depth of standard frame Performing back projection to obtain depth point cloud of standard frame, and hinging object mask with standard frame Obtaining a normalized three-dimensional grid of the articulated object Initial coarse-scale estimation of a medium articulated object ; Based on initial coarse-scale estimation Defining adaptive sampling intervals For adaptive sampling interval All candidate metrics within Performing iterative computation, and measuring the scale according to the current candidate in each iteration Normalizing three-dimensional grids for articulating objects Scaling to obtain candidate metric under world coordinate system Corresponding candidate object grid ; Then to the candidate object grid Processing to obtain candidate object contour mask Calculating candidate object contour mask And candidate metrics Corresponding mask for hinged object obtained from video segmentation model Comparing the current overlap metric with the iteration round, comparing the current overlap metric with the recorded optimal overlap metric and the corresponding iteration round, selecting and updating the optimal metric and the optimal overlap metric according to the comparison result, and selecting and expanding the adaptive sampling interval Until the iteration is finished, the candidate object grid corresponding to the highest value of the overlapping metric Hinged object grid as a metrology specification in world coordinate systems 。
4. A method of 4D interactive reconstruction of a hand and an articulated object based on monocular video input according to claim 3, characterized in that an object-candidate contour mask is obtained The method of (1) is as follows: for candidate object grid Canonical frame image Canonical frame metric depth Mask for standard frame hinged object Internal reference matrix of camera Estimating and obtaining candidate six-degree-of-freedom space pose by adopting pose estimation model Candidate six-degree-of-freedom spatial pose Mesh with candidate object Combining and referencing matrix in camera Performing contour rendering to obtain a candidate object contour mask ; Metric specification hinged object grid Normalizing three-dimensional meshes from articulated objects Through optimal measurement And the optimal six-degree-of-freedom spatial pose And (5) transforming to obtain the product.
5. The method for 4D interactive reconstruction of a hand and an articulated object based on monocular video input of claim 4, wherein a metric specification articulated object grid is obtained The method for hinging the mesh area of each part of the object comprises the following steps: Suppose that RGB image is restored in the ith frame In which the articulated object is divided into W parts, the part mask is represented as For the measurement of the standard hinged object grid Dividing the articulated object parts and obtaining a metric standard articulated object grid through PARTFIELD network Is based on vertex characteristics versus metric specification articulated object mesh Is grouped by vertex of component mask And (5) corresponding the groups to the components to obtain a grid area of each component.
6. The method for reconstructing 4D interactions between a hand and a hinged object based on monocular video input of claim 5, wherein the method for obtaining three-dimensional tracking trajectories of each component of the hinged object is as follows: For the w-th component, in-component masking Internal random sampling The pixels are used as query points, the query points are tracked, and the query points are obtained The two-dimensional motion track of the component is lifted to a three-dimensional space by combining the standard frame depth point cloud to obtain a three-dimensional tracking track and visibility indication of each query point Wherein Repairing the three-dimensional tracking trace of the q-th query point of the w-th component of the RGB image for the i-th frame, , For indicating whether the qth query point is visible in the ith frame repair RGB image.
7. The method for 4D interactive reconstruction of a hand and a hinged object based on monocular video input of claim 6, wherein the process of performing solution optimization of rigid transformation parameters for each component of the hinged object comprises optimizing component rigid transformation of the hinged object corresponding to the repair RGB image of adjacent frames on both sides of the current reference frame interval based on the current reference frame interval: First, a canonical frame image is formed Based on the reference frame section, applying consistency constraint based on three-dimensional point tracking track to initial rigid transformation parameters of all parts of the hinged object corresponding to the adjacent frame repair RGB images at two sides of the reference frame section to obtain rigid transformation optimization results; Tracking consistency loss corresponding to consistency constraint of the three-dimensional point tracking track The definition is as follows: , In the middle of For the index of the reference frame interval, Representing the set of query points visible in both the i-th frame repair RGB image and the j-th frame reference frame, The rigid body transformation in the RGB image is restored for the w-th component in the i-th frame, An optimized rigid body transformation for the w-th component in the j-th frame reference frame; , In the middle of To indicate whether the qth query point is visible in the jth frame of reference.
8. The method of monocular video input-based 4D interactive reconstruction of a hand and articulated object of claim 7, wherein the process of solving for optimization of rigid body transformation parameters for each component of the articulated object further comprises calculating a time-smoothed regularization term loss : , In the middle of Representing application to rigid body transformations along the time dimension Is a discrete second order difference operator; then the objective function is optimized for the time-series motion of the w-th component The method comprises the following steps: , In the middle of Is a regularized term weight coefficient.
9. The method for reconstructing 4D interactions between a hand and a hinged object based on monocular video input of claim 8, wherein the method for obtaining 4D meshes of the hand is as follows: For monocular video sequences Processing by adopting a hand gesture estimation model to obtain MANO joint gesture parameters of the hand of the ith frame RGB image Hand shape parameters And hand global rigid body transformation Wherein the MANO joint pose parameters For a joint angle parameter vector with dimension 45, the hand shape parameter Hand reconstruction, and spherical linear interpolation to obtain the MANO joint posture parameters of adjacent frames And hand global rigid body transformation And (5) completing to obtain a hand 4D grid which is continuous and smooth in time.
10. The method of monocular video input based 4D interactive reconstruction of a hand and articulated object of claim 9, wherein optimizing the metric scale of the articulated object in the results of rigid body motion reconstruction of each part of the articulated object is through contact constraint loss based on dynamic hand-object contact The realization is as follows: , In the middle of For monocular video sequences The frame set in which the middle hand is in contact with the object represents the finger set determined to be in contact in the ith frame RGB image as Collecting fingers The set of MANO mesh vertices corresponding to the middle finger tip is represented as , Articulating object grid for metrology specification In relative to Is the closest point of (a) to (b), Is the top of the fingertip; Monocular video sequence Frame set with middle hand in contact with object From monocular video sequences A kind of electronic device The frame RGB image and the measurement depth after coloring are obtained by multi-modal large-scale language model reasoning, monocular video sequence A kind of electronic device The measurement depth is estimated and obtained by the frame RGB image through a monocular depth estimation model; Method for joint optimization of hand joint pose parameters and hand global rigid transformation in hand 4D grid through contact constraint loss based on dynamic hand object contact And hand movement regularization term Co-realization, hand movement regularization term The definition is as follows: , In the middle of The canonical constraint weight coefficients for the global rigid body transformation of the hand, Representing application of a global rigid body transformation to a hand along a time dimension Is a discrete second order difference operator of (c), The weight coefficient of each joint gesture parameter of the hand, Initial estimation results of the gestures of all joints of the hand are obtained; then the overall optimization objective function of the hand interaction The method comprises the following steps: ; and finally, obtaining a reconstruction result of 4D hand interaction.

Description

4D interactive reconstruction method for hand and hinged object based on monocular video input Technical Field The invention relates to a hand and articulated object 4D interactive reconstruction method based on monocular video input, and belongs to the technical field of three-dimensional reconstruction. Background The Hand-object interaction (HOI) reconstruction refers to recovering the three-dimensional structure and time sequence change of hands, objects and interaction relations thereof from visual observation, and has important application value in the fields of human behavior analysis, robot operation, augmented reality and the like. The existing early reconstruction method generally depends on a predefined object template or class prior, has limited application range, and is difficult to cope with diversified and unknown objects in a real environment. In recent years, some template-independent and class-independent HOI reconstruction methods have emerged, but most of them are based on rigid object assumptions, which make it difficult to handle articulated objects with movable parts. On the other hand, there has been some research progress towards four-dimensional reconstruction of articulated objects, but existing methods typically rely on object pre-scanning to obtain canonical forms, or require multi-view video as input. In natural scenarios, such conditions are often difficult to meet, especially in cases where only monocular video is available, reconstruction of the hand-articulated object interaction still lacks an efficient technical path. In a real interaction process, the articulated object often accompanies complex motions and frequent occlusion, so that the HOI reconstruction based on monocular video becomes a highly ill-posed problem. To alleviate the information deficiency, researchers began to introduce various types of basic model priors, such as images to three-dimensional models for restoring object geometry, depth and tracking models for estimating scale and motion, dedicated models for hand reconstruction, and multi-modal large language models for reasoning about interaction states. However, it is often difficult to achieve a stabilizing effect with a simple combination of the above models in the prior art. Even if four-dimensional representations of the hand and the object are obtained separately, direct combination is still difficult to determine the true dimensions and spatial positions, and physical unreasonable phenomena such as inconsistent dimensions, interpenetration, contact separation and the like are generated. Therefore, how to reconstruct the 4D physical interaction of a hand with an articulated object under monocular video conditions remains a key technical problem to be solved in the art. Disclosure of Invention Aiming at the problem that 4D physical interaction between a hand and an articulated object cannot be reconstructed under the monocular video condition, the invention provides a monocular video input-based 4D interaction reconstruction method for the hand and the articulated object. The invention relates to a hand and articulated object 4D interactive reconstruction method based on monocular video input, which comprises the following steps of, Acquisition includesMonocular video sequence of frame RGB imagesWhereinIs the firstA frame RGB image; Removing hand regions in each frame of RGB image to obtain a repair video sequence WhereinIs the firstFrame repair of RGB images in video sequencesSelecting one frame of RGB image as a standard frame image, obtaining a normalized three-dimensional grid of the hinged object based on the standard frame image, and estimating a repair video sequence by using a monocular depth estimation modelIs a measure of depth; To normalize frame images, repair video sequences The measurement depth of the hinged object and the normalized three-dimensional grid of the hinged object are used as input, a self-adaptive sampling refinement method is adopted to process the normalized three-dimensional grid of the hinged object, and the measurement scale and the six-degree-of-freedom space pose of the hinged object in a standard frame image are solved through an optimization method, so that the measurement standard hinged object grid under a world coordinate system is obtained; for repairing video sequence Performing component-level dense point tracking to obtain two-dimensional motion tracks of all components of the hinged object, and combining to-be-repaired video sequencesBased on the three-dimensional tracking track of each part of the articulated object, carrying out solving optimization on rigid body transformation parameters of each part of the articulated object in the measurement standard articulated object grid to obtain a rigid body motion parameter reconstruction result of each part of the articulated object; For monocular video sequences Reconstructing the hand region to obtain a hand 4D grid, and simultaneously utilizing a multi-m